CyberinfraV1, Pmc, Corpus, bibRecord, 000592

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Identifieur interne : 000592 ( Pmc/Corpus ); précédent : 000591; suivant : 000593

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Auteurs : Shibu Yooseph ; Granger Sutton ; Douglas B. Rusch ; Aaron L. Halpern ; Shannon J. Williamson ; Karin Remington ; Jonathan A. Eisen ; Karla B. Heidelberg ; Gerard Manning ; Weizhong Li ; Lukasz Jaroszewski ; Piotr Cieplak ; Christopher S. Miller ; Huiying Li ; Susan T. Mashiyama ; Marcin P. Joachimiak ; Christopher Van Belle ; John-Marc Chandonia ; David A. Soergel ; Yufeng Zhai ; Kannan Natarajan ; Shaun Lee ; Benjamin J. Raphael ; Vineet Bafna ; Robert Friedman ; Steven E. Brenner ; Adam Godzik ; David Eisenberg ; Jack E. Dixon ; Susan S. Taylor ; Robert L. Strausberg ; Marvin Frazier ; J. Craig Venter

Source :

PLoS Biology [ 1544-9173 ] ; 2007.

RBID : PMC:1821046

Abstract

Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1821046

DOI: 10.1371/journal.pbio.0050016
PubMed: 17355171
PubMed Central: 1821046

Links to Exploration step

PMC:1821046

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">The <italic>Sorcerer II</italic>
 Global Ocean Sampling Expedition: Expanding the Universe of Protein Families</title>
<author><name sortKey="Yooseph, Shibu" sort="Yooseph, Shibu" uniqKey="Yooseph S" first="Shibu" last="Yooseph">Shibu Yooseph</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Sutton, Granger" sort="Sutton, Granger" uniqKey="Sutton G" first="Granger" last="Sutton">Granger Sutton</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Rusch, Douglas B" sort="Rusch, Douglas B" uniqKey="Rusch D" first="Douglas B" last="Rusch">Douglas B. Rusch</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Halpern, Aaron L" sort="Halpern, Aaron L" uniqKey="Halpern A" first="Aaron L" last="Halpern">Aaron L. Halpern</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Williamson, Shannon J" sort="Williamson, Shannon J" uniqKey="Williamson S" first="Shannon J" last="Williamson">Shannon J. Williamson</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Remington, Karin" sort="Remington, Karin" uniqKey="Remington K" first="Karin" last="Remington">Karin Remington</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Eisen, Jonathan A" sort="Eisen, Jonathan A" uniqKey="Eisen J" first="Jonathan A" last="Eisen">Jonathan A. Eisen</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff2"> University of California, Davis, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Heidelberg, Karla B" sort="Heidelberg, Karla B" uniqKey="Heidelberg K" first="Karla B" last="Heidelberg">Karla B. Heidelberg</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Manning, Gerard" sort="Manning, Gerard" uniqKey="Manning G" first="Gerard" last="Manning">Gerard Manning</name>
<affiliation><nlm:aff id="aff3"> Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Li, Weizhong" sort="Li, Weizhong" uniqKey="Li W" first="Weizhong" last="Li">Weizhong Li</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Jaroszewski, Lukasz" sort="Jaroszewski, Lukasz" uniqKey="Jaroszewski L" first="Lukasz" last="Jaroszewski">Lukasz Jaroszewski</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Cieplak, Piotr" sort="Cieplak, Piotr" uniqKey="Cieplak P" first="Piotr" last="Cieplak">Piotr Cieplak</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Miller, Christopher S" sort="Miller, Christopher S" uniqKey="Miller C" first="Christopher S" last="Miller">Christopher S. Miller</name>
<affiliation><nlm:aff id="aff5"> University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Li, Huiying" sort="Li, Huiying" uniqKey="Li H" first="Huiying" last="Li">Huiying Li</name>
<affiliation><nlm:aff id="aff5"> University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Mashiyama, Susan T" sort="Mashiyama, Susan T" uniqKey="Mashiyama S" first="Susan T" last="Mashiyama">Susan T. Mashiyama</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Joachimiak, Marcin P" sort="Joachimiak, Marcin P" uniqKey="Joachimiak M" first="Marcin P" last="Joachimiak">Marcin P. Joachimiak</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Van Belle, Christopher" sort="Van Belle, Christopher" uniqKey="Van Belle C" first="Christopher" last="Van Belle">Christopher Van Belle</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Chandonia, John Marc" sort="Chandonia, John Marc" uniqKey="Chandonia J" first="John-Marc" last="Chandonia">John-Marc Chandonia</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff7"> Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Soergel, David A" sort="Soergel, David A" uniqKey="Soergel D" first="David A" last="Soergel">David A. Soergel</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Zhai, Yufeng" sort="Zhai, Yufeng" uniqKey="Zhai Y" first="Yufeng" last="Zhai">Yufeng Zhai</name>
<affiliation><nlm:aff id="aff3"> Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Natarajan, Kannan" sort="Natarajan, Kannan" uniqKey="Natarajan K" first="Kannan" last="Natarajan">Kannan Natarajan</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Lee, Shaun" sort="Lee, Shaun" uniqKey="Lee S" first="Shaun" last="Lee">Shaun Lee</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Raphael, Benjamin J" sort="Raphael, Benjamin J" uniqKey="Raphael B" first="Benjamin J" last="Raphael">Benjamin J. Raphael</name>
<affiliation><nlm:aff id="aff9"> Brown University, Providence, Rhode Island, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Bafna, Vineet" sort="Bafna, Vineet" uniqKey="Bafna V" first="Vineet" last="Bafna">Vineet Bafna</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Friedman, Robert" sort="Friedman, Robert" uniqKey="Friedman R" first="Robert" last="Friedman">Robert Friedman</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Brenner, Steven E" sort="Brenner, Steven E" uniqKey="Brenner S" first="Steven E" last="Brenner">Steven E. Brenner</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Godzik, Adam" sort="Godzik, Adam" uniqKey="Godzik A" first="Adam" last="Godzik">Adam Godzik</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Eisenberg, David" sort="Eisenberg, David" uniqKey="Eisenberg D" first="David" last="Eisenberg">David Eisenberg</name>
<affiliation><nlm:aff id="aff5"> University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Dixon, Jack E" sort="Dixon, Jack E" uniqKey="Dixon J" first="Jack E" last="Dixon">Jack E. Dixon</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Taylor, Susan S" sort="Taylor, Susan S" uniqKey="Taylor S" first="Susan S" last="Taylor">Susan S. Taylor</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Strausberg, Robert L" sort="Strausberg, Robert L" uniqKey="Strausberg R" first="Robert L" last="Strausberg">Robert L. Strausberg</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Frazier, Marvin" sort="Frazier, Marvin" uniqKey="Frazier M" first="Marvin" last="Frazier">Marvin Frazier</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Venter, J Craig" sort="Venter, J Craig" uniqKey="Venter J" first="J. Craig" last="Venter">J. Craig Venter</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">17355171</idno>
<idno type="pmc">1821046</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1821046</idno>
<idno type="RBID">PMC:1821046</idno>
<idno type="doi">10.1371/journal.pbio.0050016</idno>
<date when="2007">2007</date>
<idno type="wicri:Area/Pmc/Corpus">000592</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">The <italic>Sorcerer II</italic>
 Global Ocean Sampling Expedition: Expanding the Universe of Protein Families</title>
<author><name sortKey="Yooseph, Shibu" sort="Yooseph, Shibu" uniqKey="Yooseph S" first="Shibu" last="Yooseph">Shibu Yooseph</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Sutton, Granger" sort="Sutton, Granger" uniqKey="Sutton G" first="Granger" last="Sutton">Granger Sutton</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Rusch, Douglas B" sort="Rusch, Douglas B" uniqKey="Rusch D" first="Douglas B" last="Rusch">Douglas B. Rusch</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Halpern, Aaron L" sort="Halpern, Aaron L" uniqKey="Halpern A" first="Aaron L" last="Halpern">Aaron L. Halpern</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Williamson, Shannon J" sort="Williamson, Shannon J" uniqKey="Williamson S" first="Shannon J" last="Williamson">Shannon J. Williamson</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Remington, Karin" sort="Remington, Karin" uniqKey="Remington K" first="Karin" last="Remington">Karin Remington</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Eisen, Jonathan A" sort="Eisen, Jonathan A" uniqKey="Eisen J" first="Jonathan A" last="Eisen">Jonathan A. Eisen</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff2"> University of California, Davis, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Heidelberg, Karla B" sort="Heidelberg, Karla B" uniqKey="Heidelberg K" first="Karla B" last="Heidelberg">Karla B. Heidelberg</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Manning, Gerard" sort="Manning, Gerard" uniqKey="Manning G" first="Gerard" last="Manning">Gerard Manning</name>
<affiliation><nlm:aff id="aff3"> Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Li, Weizhong" sort="Li, Weizhong" uniqKey="Li W" first="Weizhong" last="Li">Weizhong Li</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Jaroszewski, Lukasz" sort="Jaroszewski, Lukasz" uniqKey="Jaroszewski L" first="Lukasz" last="Jaroszewski">Lukasz Jaroszewski</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Cieplak, Piotr" sort="Cieplak, Piotr" uniqKey="Cieplak P" first="Piotr" last="Cieplak">Piotr Cieplak</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Miller, Christopher S" sort="Miller, Christopher S" uniqKey="Miller C" first="Christopher S" last="Miller">Christopher S. Miller</name>
<affiliation><nlm:aff id="aff5"> University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Li, Huiying" sort="Li, Huiying" uniqKey="Li H" first="Huiying" last="Li">Huiying Li</name>
<affiliation><nlm:aff id="aff5"> University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Mashiyama, Susan T" sort="Mashiyama, Susan T" uniqKey="Mashiyama S" first="Susan T" last="Mashiyama">Susan T. Mashiyama</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Joachimiak, Marcin P" sort="Joachimiak, Marcin P" uniqKey="Joachimiak M" first="Marcin P" last="Joachimiak">Marcin P. Joachimiak</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Van Belle, Christopher" sort="Van Belle, Christopher" uniqKey="Van Belle C" first="Christopher" last="Van Belle">Christopher Van Belle</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Chandonia, John Marc" sort="Chandonia, John Marc" uniqKey="Chandonia J" first="John-Marc" last="Chandonia">John-Marc Chandonia</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="aff7"> Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Soergel, David A" sort="Soergel, David A" uniqKey="Soergel D" first="David A" last="Soergel">David A. Soergel</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Zhai, Yufeng" sort="Zhai, Yufeng" uniqKey="Zhai Y" first="Yufeng" last="Zhai">Yufeng Zhai</name>
<affiliation><nlm:aff id="aff3"> Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Natarajan, Kannan" sort="Natarajan, Kannan" uniqKey="Natarajan K" first="Kannan" last="Natarajan">Kannan Natarajan</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Lee, Shaun" sort="Lee, Shaun" uniqKey="Lee S" first="Shaun" last="Lee">Shaun Lee</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Raphael, Benjamin J" sort="Raphael, Benjamin J" uniqKey="Raphael B" first="Benjamin J" last="Raphael">Benjamin J. Raphael</name>
<affiliation><nlm:aff id="aff9"> Brown University, Providence, Rhode Island, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Bafna, Vineet" sort="Bafna, Vineet" uniqKey="Bafna V" first="Vineet" last="Bafna">Vineet Bafna</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Friedman, Robert" sort="Friedman, Robert" uniqKey="Friedman R" first="Robert" last="Friedman">Robert Friedman</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Brenner, Steven E" sort="Brenner, Steven E" uniqKey="Brenner S" first="Steven E" last="Brenner">Steven E. Brenner</name>
<affiliation><nlm:aff id="aff6"> University of California Berkeley, Berkeley, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Godzik, Adam" sort="Godzik, Adam" uniqKey="Godzik A" first="Adam" last="Godzik">Adam Godzik</name>
<affiliation><nlm:aff id="aff4"> Burnham Institute for Medical Research, La Jolla, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Eisenberg, David" sort="Eisenberg, David" uniqKey="Eisenberg D" first="David" last="Eisenberg">David Eisenberg</name>
<affiliation><nlm:aff id="aff5"> University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Dixon, Jack E" sort="Dixon, Jack E" uniqKey="Dixon J" first="Jack E" last="Dixon">Jack E. Dixon</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Taylor, Susan S" sort="Taylor, Susan S" uniqKey="Taylor S" first="Susan S" last="Taylor">Susan S. Taylor</name>
<affiliation><nlm:aff id="aff8"> University of California San Diego, San Diego, California, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Strausberg, Robert L" sort="Strausberg, Robert L" uniqKey="Strausberg R" first="Robert L" last="Strausberg">Robert L. Strausberg</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Frazier, Marvin" sort="Frazier, Marvin" uniqKey="Frazier M" first="Marvin" last="Frazier">Marvin Frazier</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Venter, J Craig" sort="Venter, J Craig" uniqKey="Venter J" first="J. Craig" last="Venter">J. Craig Venter</name>
<affiliation><nlm:aff id="aff1"> J. Craig Venter Institute, Rockville, Maryland, United States of America</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PLoS Biology</title>
<idno type="ISSN">1544-9173</idno>
<idno type="eISSN">1545-7885</idno>
<imprint><date when="2007">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PLoS Biol</journal-id>
<journal-id journal-id-type="publisher-id">pbio</journal-id>
<journal-title>PLoS Biology</journal-title>
<issn pub-type="ppub">1544-9173</issn>
<issn pub-type="epub">1545-7885</issn>
<publisher><publisher-name>Public Library of Science</publisher-name>
 <publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">17355171</article-id>
<article-id pub-id-type="pmc">1821046</article-id>
<article-id pub-id-type="doi">10.1371/journal.pbio.0050016</article-id>
<article-id pub-id-type="publisher-id">06-PLBI-RA-0500R3</article-id>
<article-id pub-id-type="sici">plbi-05-03-23</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline"><subject>Computational Biology</subject>
<subject>Evolutionary Biology</subject>
<subject>Genetics and Genomics</subject>
<subject>Molecular Biology</subject>
</subj-group>
<subj-group subj-group-type="System Taxonomy"><subject>Eubacteria</subject>
<subject>Viruses</subject>
</subj-group>
<series-title>Oceanic Metagenomics</series-title>
</article-categories>
<title-group><article-title>The <italic>Sorcerer II</italic>
 Global Ocean Sampling Expedition: Expanding the Universe of Protein Families</article-title>
<alt-title alt-title-type="running-head">Expanding the Protein Family Universe</alt-title>
</title-group>
<contrib-group><contrib contrib-type="author"><name><surname>Yooseph</surname>
<given-names>Shibu</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Sutton</surname>
<given-names>Granger</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Rusch</surname>
<given-names>Douglas B</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Halpern</surname>
<given-names>Aaron L</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Williamson</surname>
<given-names>Shannon J</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Remington</surname>
<given-names>Karin</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Eisen</surname>
<given-names>Jonathan A</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
<xref rid="aff2" ref-type="aff">2</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Heidelberg</surname>
<given-names>Karla B</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Manning</surname>
<given-names>Gerard</given-names>
</name>
<xref rid="aff3" ref-type="aff">3</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Li</surname>
<given-names>Weizhong</given-names>
</name>
<xref rid="aff4" ref-type="aff">4</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Jaroszewski</surname>
<given-names>Lukasz</given-names>
</name>
<xref rid="aff4" ref-type="aff">4</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Cieplak</surname>
<given-names>Piotr</given-names>
</name>
<xref rid="aff4" ref-type="aff">4</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Miller</surname>
<given-names>Christopher S</given-names>
</name>
<xref rid="aff5" ref-type="aff">5</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Li</surname>
<given-names>Huiying</given-names>
</name>
<xref rid="aff5" ref-type="aff">5</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Mashiyama</surname>
<given-names>Susan T</given-names>
</name>
<xref rid="aff6" ref-type="aff">6</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Joachimiak</surname>
<given-names>Marcin P</given-names>
</name>
<xref rid="aff6" ref-type="aff">6</xref>
</contrib>
<contrib contrib-type="author"><name><surname>van Belle</surname>
<given-names>Christopher</given-names>
</name>
<xref rid="aff6" ref-type="aff">6</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Chandonia</surname>
<given-names>John-Marc</given-names>
</name>
<xref rid="aff6" ref-type="aff">6</xref>
<xref rid="aff7" ref-type="aff">7</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Soergel</surname>
<given-names>David A</given-names>
</name>
<xref rid="aff6" ref-type="aff">6</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Zhai</surname>
<given-names>Yufeng</given-names>
</name>
<xref rid="aff3" ref-type="aff">3</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Natarajan</surname>
<given-names>Kannan</given-names>
</name>
<xref rid="aff8" ref-type="aff">8</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Lee</surname>
<given-names>Shaun</given-names>
</name>
<xref rid="aff8" ref-type="aff">8</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Raphael</surname>
<given-names>Benjamin J</given-names>
</name>
<xref rid="aff9" ref-type="aff">9</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Bafna</surname>
<given-names>Vineet</given-names>
</name>
<xref rid="aff8" ref-type="aff">8</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Friedman</surname>
<given-names>Robert</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Brenner</surname>
<given-names>Steven E</given-names>
</name>
<xref rid="aff6" ref-type="aff">6</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Godzik</surname>
<given-names>Adam</given-names>
</name>
<xref rid="aff4" ref-type="aff">4</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Eisenberg</surname>
<given-names>David</given-names>
</name>
<xref rid="aff5" ref-type="aff">5</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Dixon</surname>
<given-names>Jack E</given-names>
</name>
<xref rid="aff8" ref-type="aff">8</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Taylor</surname>
<given-names>Susan S</given-names>
</name>
<xref rid="aff8" ref-type="aff">8</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Strausberg</surname>
<given-names>Robert L</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Frazier</surname>
<given-names>Marvin</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
<contrib contrib-type="author"><name><surname>Venter</surname>
<given-names>J. Craig</given-names>
</name>
<xref rid="aff1" ref-type="aff">1</xref>
</contrib>
</contrib-group>
<aff id="aff1"><label>1</label>
 J. Craig Venter Institute, Rockville, Maryland, United States of America</aff>
<aff id="aff2"><label>2</label>
 University of California, Davis, California, United States of America</aff>
<aff id="aff3"><label>3</label>
 Razavi-Newman Center for Bioinformatics, Salk Institute for Biological Studies, La Jolla, California, United States of America</aff>
<aff id="aff4"><label>4</label>
 Burnham Institute for Medical Research, La Jolla, California, United States of America</aff>
<aff id="aff5"><label>5</label>
 University of California Los Angeles–Department of Energy Institute for Genomics and Proteomics, Los Angeles, California, United States of America</aff>
<aff id="aff6"><label>6</label>
 University of California Berkeley, Berkeley, California, United States of America</aff>
<aff id="aff7"><label>7</label>
 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America</aff>
<aff id="aff8"><label>8</label>
 University of California San Diego, San Diego, California, United States of America</aff>
<aff id="aff9"><label>9</label>
 Brown University, Providence, Rhode Island, United States of America</aff>
<contrib-group><contrib contrib-type="editor"><name><surname>Eddy</surname>
<given-names>Sean</given-names>
</name>
<role>Academic Editor</role>
<xref rid="edit1" ref-type="aff"></xref>
</contrib>
</contrib-group>
<aff id="edit1">Washington University St. Louis, United States of America</aff>
<author-notes><corresp id="cor1">* To whom correspondence should be addressed. E-mail: <email>Shibu.Yooseph@venterinstitute.org</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub"><month>3</month>
<year>2007</year>
</pub-date>
<pub-date pub-type="epub"><day>13</day>
<month>3</month>
<year>2007</year>
</pub-date>
<volume>5</volume>
<issue>3</issue>
<elocation-id>e16</elocation-id>
<history><date date-type="received"><day>24</day>
<month>3</month>
<year>2006</year>
</date>
<date date-type="accepted"><day>15</day>
<month>8</month>
<year>2006</year>
</date>
</history>
<copyright-statement><bold>Copyright:</bold>
 © 2007 Yooseph et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</copyright-statement>
<copyright-year>2007</copyright-year>
<related-article xlink:href="10.1371/journal.pbio.0050085" related-article-type="companion" xlink:title="Synopsis" vol="5" page="e85" id="N0x8eb5648N0x8eaa8a8" ext-link-type="doi"><article-title>Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity</article-title>
</related-article>
<related-article xlink:href="10.1371/journal.pbio.0050082" related-article-type="companion" xlink:title="Essay" vol="5" page="e82" id="N0x8eb5648N0x8eaa8d8" ext-link-type="doi"><article-title>Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes</article-title>
</related-article>
<related-article xlink:href="10.1371/journal.pbio.0050075" related-article-type="companion" xlink:title="Community Page" vol="5" page="e75" id="N0x8eb5648N0x8eaa908" ext-link-type="doi"><article-title>CAMERA: A Community Resource for Metagenomics</article-title>
</related-article>
<related-article xlink:href="10.1371/journal.pbio.0050017" related-article-type="companion" xlink:title="Research Article" vol="5" page="e17" id="N0x8eb5648N0x8eaa938" ext-link-type="doi"><article-title>Structural and Functional Diversity of the Microbial Kinome</article-title>
</related-article>
<related-article xlink:href="10.1371/journal.pbio.0050083" related-article-type="companion" xlink:title="Editorial" vol="5" page="e83" id="N0x8eb5648N0x8eaa968" ext-link-type="doi"><article-title>Global Ocean Sampling Collection</article-title>
</related-article>
<related-article xlink:href="10.1371/journal.pbio.0050077" related-article-type="companion" xlink:title="Research Article" vol="5" page="e77" id="N0x8eb5648N0x8eaa998" ext-link-type="doi"><article-title>The<italic>Sorcerer II</italic>
 Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific</article-title>
</related-article>
<related-article xlink:href="10.1371/journal.pbio.0050074" related-article-type="companion" xlink:title="Feature" vol="5" page="e74" id="N0x8eb5648N0x8eaa9c8" ext-link-type="doi"><article-title><italic>Sorcerer II:</italic>
 The Search for Microbial Diversity Roils the Waters</article-title>
</related-article>
<abstract><p>Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.</p>
</abstract>
<abstract abstract-type="summary"><title>Author Summary</title>
<sec id="st1"><title></title>
<p>The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes will not grow in the laboratory using current cultivation techniques, scientists have turned to cultivation-independent techniques to study microbial diversity. One such technique—shotgun sequencing—allows random sampling of DNA sequences to examine the genomic material present in a microbial community. We used shotgun sequencing to examine microbial communities in water samples collected by the <italic>Sorcerer II</italic>
 Global Ocean Sampling (GOS) expedition. Our analysis predicted more than six million proteins in the GOS data—nearly twice the number of proteins present in current databases. These predictions add tremendous diversity to known protein families and cover nearly all known prokaryotic protein families. Some of the predicted proteins had no similarity to any currently known proteins and therefore represent new families. A higher than expected fraction of these novel families is predicted to be of viral origin. We also found that several protein domains that were previously thought to be kingdom specific have GOS examples in other kingdoms. Our analysis opens the door for a multitude of follow-up protein family analyses and indicates that we are a long way from sampling all the protein families that exist in nature.</p>
</sec>
</abstract>
<abstract abstract-type="toc"><p>The GOS data identified 6.12 million predicted proteins covering nearly all known prokaryotic protein families, and several new families. This almost doubles the number of known proteins and shows that we are far from identifying all the proteins in nature.</p>
</abstract>
<counts><page-count count="35"></page-count>
</counts>
<custom-meta-wrap><custom-meta><meta-name>citation</meta-name>
<meta-value>Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. (2007) The <italic>Sorcerer II</italic>
 Global Ocean Sampling expedition: Expanding the universe of protein families. PLoS Biol 5(3): e16. doi:<ext-link ext-link-type="doi" xlink:href="10.1371/journal.pbio.0050016">10.1371/journal.pbio.0050016</ext-link>
</meta-value>
</custom-meta>
<custom-meta><meta-name>article-logo</meta-name>
<meta-value>oceaniclogo.jpg</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body><sec id="s1"><title>Introduction</title>
<p>Despite many efforts to classify and organize proteins [<xref ref-type="bibr" rid="pbio-0050016-b001">1</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b006">6</xref>
] from both structural and functional perspectives, we are far from a clear understanding of the size and diversity of the protein universe [<xref ref-type="bibr" rid="pbio-0050016-b007">7</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b009">9</xref>
]. Environmental shotgun sequencing projects, in which genetic sequences are sampled from communities of microorganisms [<xref ref-type="bibr" rid="pbio-0050016-b010">10</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b014">14</xref>
], are poised to make a dramatic impact on our understanding of proteins and protein families. These studies are not limited to culturable organisms, and there are no selection biases for protein classes or organisms. These studies typically provide a gene-centric (as opposed to an organism-centric) view of the environment and allow the examination of questions related to protein family evolution and diversity. The protein predictions from some of these studies are characterized both by their sheer number and diversity. For instance, the recent Sargasso Sea study [<xref ref-type="bibr" rid="pbio-0050016-b010">10</xref>
] resulted in 1.2 million protein predictions and identified new subfamilies for several known protein families.</p>
<p>Protein exploration starts by clustering proteins into groups or <italic>families</italic>
 of evolutionarily related sequences. The notion of a protein family, while biologically very relevant, is hard to realize precisely in mathematical terms, thereby making the large-scale computational clustering and classification problem nontrivial. Techniques for these problems typically rely on <italic>sequence similarity</italic>
 to group sequences. Proteins can be grouped into families based on the highly conserved structural units, called <italic>domains,</italic>
 that they contain [<xref ref-type="bibr" rid="pbio-0050016-b015">15</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b016">16</xref>
]. Alternatively, proteins are grouped into families based on their full sequence [<xref ref-type="bibr" rid="pbio-0050016-b017">17</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b018">18</xref>
]. Many of these classifications, together with various expert-curated databases [<xref ref-type="bibr" rid="pbio-0050016-b019">19</xref>
] such as Swiss-Prot [<xref ref-type="bibr" rid="pbio-0050016-b020">20</xref>
], Pfam [<xref ref-type="bibr" rid="pbio-0050016-b015">15</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b021">21</xref>
], and TIGRFAM [<xref ref-type="bibr" rid="pbio-0050016-b022">22</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b023">23</xref>
], or integrated efforts such as Uniprot [<xref ref-type="bibr" rid="pbio-0050016-b024">24</xref>
] and InterPro [<xref ref-type="bibr" rid="pbio-0050016-b025">25</xref>
], provide rich resources for protein annotation. However, a vast number of protein predictions remain unclassified both in terms of structure and function. Given varying rates of evolution, there is unlikely to be a single similarity threshold or even a small set of thresholds that can be used to define every protein family in nature. Consequently, estimates of the number of families that exist in nature vary considerably based on the different thresholds used and assumptions made in the classification process [<xref ref-type="bibr" rid="pbio-0050016-b026">26</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b029">29</xref>
].</p>
<p>In this study, we explored proteins using a comprehensive dataset of publicly available sequences together with environmental sequence data generated by the <italic>Sorcerer II</italic>
 Global Ocean Sampling (GOS) expedition [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
]. We used a novel clustering technique based on full-length sequence similarity both to predict proteins and to group related sequences. The goals were to understand the rate of discovery of protein families with the increasing number of protein predictions, explore novel families, and assess the impact of the environmental sequences from the expedition on known proteins and protein families. We used hidden Markov model (HMM) profiling to examine the relative biases in protein domain distributions in the GOS data and existing protein databases. This profiling was also used to assess the impact of the GOS data on target selection for protein structure characterization efforts. We carried out in-depth analyses on several protein families to validate our clustering approach and to understand the diversity and evolutionary information that the GOS data added; the families included ultraviolet (UV) irradiation DNA damage repair enzymes, phosphatases, proteases, and the metabolic enzymes glutamine synthetase and RuBisCO.</p>
</sec>
<sec id="s2"><title>Results/Discussion</title>
<sec id="s2a"><title>Data Generation, Sequence Clustering, and HMM Profiling</title>
<p>We used the following publicly available datasets in this study (<xref ref-type="table" rid="pbio-0050016-t001">Table 1</xref>
)—the National Center for Biotechnology Information (NCBI)'s nonredundant protein database (NCBI-nr) [<xref ref-type="bibr" rid="pbio-0050016-b031">31</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b032">32</xref>
], NCBI Prokaryotic Genomes (PG) [<xref ref-type="bibr" rid="pbio-0050016-b031">31</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b033">33</xref>
], TIGR Gene Indices (TGI-EST) [<xref ref-type="bibr" rid="pbio-0050016-b034">34</xref>
], and Ensembl (ENS) [<xref ref-type="bibr" rid="pbio-0050016-b035">35</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b036">36</xref>
]. The rationale for including these datasets is discussed in <xref ref-type="sec" rid="s3">Materials and Methods</xref>
. All datasets were downloaded on February 10, 2005.</p>
<p>None of the above-mentioned databases contained sequences from the Sargasso Sea study [<xref ref-type="bibr" rid="pbio-0050016-b010">10</xref>
], the largest environmental survey to date, and so we pooled reads from the Sargasso Sea study with the reads from the <italic>Sorcerer II</italic>
 GOS expedition [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
], creating a combined set that we call the GOS dataset. The GOS dataset was assembled using the Celera Assembler [<xref ref-type="bibr" rid="pbio-0050016-b037">37</xref>
] as described in [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
] (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). The GOS dataset was primarily generated from the 0.1 μm to 0.8 μm size filters and thus is expected to be mostly microbial [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
]. The data also included a small set of sequences from a viral size (<0.1 μm) fraction (<xref ref-type="table" rid="pbio-0050016-t001">Table 1</xref>
).</p>
<p>We identified open reading frames (ORFs) from the DNA sequences in the PG, TGI-EST, and GOS datasets. An ORF is commonly defined as a translated DNA sequence that begins with a start codon and ends with a stop codon. To accommodate partial DNA sequences, we extended this definition to allow an ORF to be bracketed by either a start codon or the start of the DNA sequence, and by either a stop codon or the end of the DNA sequence. ORFs were generated by considering translations of the DNA sequence in all six frames. For ORFs from the PG and TGI-EST datasets, we used the appropriate codon usage table for the known organism. For GOS ORFs from the assembled sequences, we used translation table 11 (the code for bacteria, archaea, and prokaryotic viruses) [<xref ref-type="bibr" rid="pbio-0050016-b031">31</xref>
]. We did not include alternate codon translations in this analysis. For all datasets, only ORFs containing at least 60 amino acids (aa) were considered. Not all ORFs are proteins. In this paper, ORFs that have reasonable evidence for being proteins are called <italic>predicted proteins;</italic>
 other ORFs are called <italic>spurious ORFs</italic>
.</p>
<p>In summary, the total input data for this study (<xref ref-type="table" rid="pbio-0050016-t001">Table 1</xref>
) consisted of 28,610,994 sequences from NCBI-nr, PG, TGI-EST, ENS, and GOS. All data and analysis results will be made publicly available (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
).</p>
<p>We used a sequence similarity clustering to group related sequences and subsequently predicted proteins from this grouping. This approach of protein prediction was adopted for two reasons. First, the GOS data make up a major portion of the dataset being analyzed, and a large fraction of GOS ORFs are fragmentary sequences. Traditional annotation pipelines/gene finders, which presume complete or near-complete genomic data, perform unsatisfactorily on this type of data. Second, protein prediction based on the comparison of ORFs to known protein sequences imposes limits on the protein families that can be explored. In particular, novel proteins that belong to known families will not be detected if they are sufficiently distant from known members of that family. This is the case even though there may be other novel proteins that can transitively link them to the known proteins. Similarly, truly novel protein families will also not be detected.</p>
<p>As the primary input to our clustering process, we computed the pairwise sequence similarity of the 28.6 million aa sequences in our dataset using an all-against-all BLAST search [<xref ref-type="bibr" rid="pbio-0050016-b038">38</xref>
]. This required more than 1 million CPU hours on two large compute clusters (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). The sequences were clustered in four steps (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). In the first step, we identified a nonredundant set of sequences from the entire dataset using only pairwise matches with ≥98% similarity and involving ≥95% of the length of the shorter sequence. This step served the dual role of identifying highly conserved groups of sequences (where each group was represented by a <italic>nonredundant</italic>
 sequence) and removing redundancy in the dataset due to identical and near-identical sequences. Only nonredundant sequences were considered for further steps in our clustering procedure. In the second step, we identified <italic>core sets</italic>
 of similar sequences using only matches between two sequences involving ≥80% of the length of the longer sequence. We used a graph-theoretic procedure to identify dense subgraphs (the core sets) within a graph defined by these matches. While the match parameters we used in this step were more relaxed than those in the first step, we chose them to reduce the grouping of unrelated sequences while simultaneously reducing the unnecessary splitting of families. In the third step, these core sets were transformed into profiles, and we used a profile–profile method [<xref ref-type="bibr" rid="pbio-0050016-b039">39</xref>
] to merge related core sets into larger groups. In the final step, we recruited sequences to core sets using sequence-profile matching (PSI-BLAST [<xref ref-type="bibr" rid="pbio-0050016-b040">40</xref>
]) and BLAST matches to core set members. We required the match to involve ≥60% of the length of the sequence being recruited.</p>
<p>We identified and removed clusters containing likely spurious ORFs using two filters (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). The first filter identified clusters containing shadow ORFs. The second filter identified clusters containing conserved but noncoding sequences, as indicated by a lack of selection at the codon level. Only clusters that remained after the two filtering steps and contained at least two nonredundant sequences are reported in this analysis.</p>
<p>We examined the distribution of known protein domains in the full dataset using profile HMMs [<xref ref-type="bibr" rid="pbio-0050016-b041">41</xref>
] from the Pfam [<xref ref-type="bibr" rid="pbio-0050016-b015">15</xref>
] and TIGRFAM [<xref ref-type="bibr" rid="pbio-0050016-b022">22</xref>
] databases (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
).</p>
<p>We labeled sequences that end up in clusters (containing at least two nonredundant sequences) or that have HMM matches as <italic>predicted proteins.</italic>
 The inclusion of the PG ORF set allowed for the evaluation of protein prediction using our clustering approach. A comparison of proteins predicted in the PG ORF set by our clustering against PG ORFs annotated as proteins by whole-genome annotation techniques revealed that our protein prediction method via clustering has a sensitivity of 83% and a specificity of 86% (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). The HMM profiling allowed for the evaluation of our clustering technique's grouping of sequences. We used Pfam models in two different ways for this assessment (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
) and make three observations. First, using a simple Pfam domain architecture-based evaluation, these clusters are mostly consistent as reflected by 93% of clusters having less than 2% unrelated pairs of sequences in them. Second, these clusters are quite conservative and can split domain families, with 58% of domain architectures being confined to single clusters and 88% of domain architectures having more than half of their occurrences in a single cluster. Third, the size distribution of these clusters is quite similar to the size distribution of clusters induced by Pfams.</p>
</sec>
<sec id="s2b"><title>Protein Prediction</title>
<p>Of the initial 28,610,944 sequences, we labeled 9,978,637 sequences (35%) as predicted proteins based on the clustering, of which nearly 60% are from GOS (<xref ref-type="table" rid="pbio-0050016-t002">Table 2</xref>
). The HMM profiling labeled only an additional 226,743 (0.8%) sequences as predicted proteins, for a total of 10,205,380 predicted proteins. This indicates that our clustering method captures most of the sequences found by profile HMMs. For sequences both in clusters and with HMM matches, (on average) 73.5% of their length is covered by HMM matches. For sequences not in clusters but with HMM matches, this value is only 45.3%. Furthermore, while 64% of sequences in clusters have HMM matches, there are 3,550,901 sequences that are grouped into clusters but do not have HMM matches. Most of these clusters correspond either to families lacking profile HMMs or contain sequences that are too remote to match above the cutoffs used. The latter is an indication of the diversity added to known families that is not picked up by current profile HMMs.</p>
<p>Using our method, the predicted proteins constitute different fractions of the totals for the five datasets, with 87% for NCBI-nr, nearly 20% for both PG ORFs and TGI-EST ORFs, 92% for ENS, and 35% for GOS. The high rate of prediction for ENS is a reflection of the high degree of conservation of proteins across the metazoan genomes, whereas the prediction rates for PG ORFs and TGI-EST ORFs are similar to rates seen in other protein prediction approaches. The 13% of NCBI-nr sequences that we marked as spurious may constitute contaminants in the form of false predictions or organism-specific proteins. Nearly two-thirds of these sequences are labeled “hypotheticals,” “unnamed,” or “unknown.” This is more than twice the fraction of similarly labeled sequences (30%) in the full NCBI-nr dataset. Of the remaining one-third, half of them are less than 100 aa in length. This suggests that they are either fast-evolving short peptides, spurious predictions, or proteins that failed to meet the length-based thresholds in the clustering.</p>
<p>Based on the clustering and the HMM profiling, there is evidence for 6,123,395 proteins in the GOS dataset (<xref ref-type="table" rid="pbio-0050016-t002">Table 2</xref>
). Given the fragmentary nature of the GOS ORFs (as a result of the GOS assembly [<xref ref-type="bibr" rid="pbio-0050016-b010">10</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
]), it is not surprising that the average length of a GOS-predicted protein (199 aa) is smaller than the average length of predicted proteins in NCBI-nr (359 aa), PG ORFs (325 aa), TGI-EST ORFs (207 aa), and ENS (489 aa). The ratio of clustered ORFs to total ORFs is significantly higher for the GOS ORFs (34%) compared to PG ORFs (19%). This could be due to a large number of false-positive protein predictions in the GOS dataset. However, this is unlikely for a variety of reasons. Nearly 4.64 million GOS ORFs (26.6%) have significant BLAST matches (with an <italic>E-</italic>
value ≤1 × 10<sup>−10</sup>
) to NCBI-nr sequences. The PG ORFs do not have a high false-positive rate compared to the submitted annotation for the prokaryotic genomes (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Most importantly, based on the fragmentary nature of GOS sequencing compared to PG sequencing, the number of shadow (spurious) ORFs ≥60 aa is significantly reduced (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
).</p>
<p>Some pairs of GOS-predicted proteins that belong to the same cluster are adjacent in the GOS assembly. While some of them correspond to tandem duplicate genes, an overwhelming fraction of the pairs are on mini-scaffolds [<xref ref-type="bibr" rid="pbio-0050016-b010">10</xref>
], indicating that they are potentially pieces of the same protein (from the same clone) that we split into fragments. We estimate that this effect applies to 3% of GOS-predicted proteins. Sequencing errors and the use of the wrong translation table can also result in the ORF generation process producing split ORF fragments.</p>
<p>The combined set of predicted proteins in NCBI-nr, PG, TGI-EST, and ENS, as expected, has a lot of redundancy. For instance, most of the PG protein predictions are in NCBI-nr. Removing exact substrings of longer sequences (i.e., 100% identity) reduces this combined set to 3,167,979 predicted proteins. When we perform the same filtering on the GOS dataset, 5,654,638 predicted proteins remain. Thus, the GOS-predicted protein set is 1.8 times the size of the predicted protein set from current publicly available datasets. We used a simple BLAST based scheme to assign kingdoms for the GOS sequences (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Of the sequences that we could annotate by kingdom, 63% of the sequences in the public datasets are from the eukaryotic kingdom, and 90.8% of the sequences in the GOS set are from the bacterial kingdom (<xref ref-type="fig" rid="pbio-0050016-g001">Figure 1</xref>
).</p>
</sec>
<sec id="s2c"><title>Protein Clustering</title>
<p>The 9,978,637 protein sequences predicted by our clustering method are grouped into 297,254 clusters of size two or more, where <italic>size</italic>
 of a cluster is defined to be the number of nonredundant sequences in the cluster. There are 280,187 small clusters (size < 20), 12,992 medium clusters (size between 20 and 200), and 4,075 large clusters (size > 200). While the 17,067 medium- and large-sized clusters constitute only 6% of the total number of clusters, they account for 85% of all the sequences that are clustered (<xref ref-type="table" rid="pbio-0050016-t003">Table 3</xref>
). Many of the largest clusters correspond to families that have functionally diversified and expanded (<xref ref-type="table" rid="pbio-0050016-t004">Table 4</xref>
). While some large families, such as the HIV envelope glycoprotein family and the immunoglobulins, also reflect biases in sequence databases, many more, including ABC transporters, kinases, and short-chain dehydrogenases, reflect their expected abundance in nature.</p>
</sec>
<sec id="s2d"><title>Rate of Discovery of Protein Families</title>
<p>We examined the rate of discovery of protein families using our clustering method to determine whether our sampling of the protein universe is reaching saturation. We find that for the present number of sequences there is an approximately linear trend in the rate of discovery of clusters with the addition of new (i.e., nonredundant) sequences (<xref ref-type="fig" rid="pbio-0050016-g002">Figure 2</xref>
). Moreover, the observed distribution of cluster sizes is well approximated by a power law [<xref ref-type="bibr" rid="pbio-0050016-b042">42</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b043">43</xref>
], and this observed power law can be used to predict the rate of growth of the number of clusters of a given size (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). This rate is dependent on the value of the power law exponent and decreases with increasing cluster sizes. We find good agreement between the observed and predicted growth rates for different cluster sizes. The approximately linear relationship between the number of clusters and the number of protein sequences indicates that there are likely many more protein families (either novel or subfamilies distantly related to known families) remaining to be discovered.</p>
</sec>
<sec id="s2e"><title>GOS versus Known Prokaryotic versus Known Nonprokaryotic</title>
<p>We also examined the GOS coverage of known proteins and protein families. Based on the cell-size filtering performed while collecting the GOS samples, we expected that the sample would predominantly be a size-limited subset of prokaryotic organisms [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
]. We studied the content of the 17,067 medium- and large-sized clusters across three groupings: (1) GOS, (2) known prokaryotic (PG together with bacterial and archaeal portions of NCBI-nr), and (3) known nonprokaryotic (TGI-EST and ENS together with viral and eukaryotic portions of NCBI-nr). The Venn diagram in <xref ref-type="fig" rid="pbio-0050016-g003">Figure 3</xref>
 shows the breakdown of these clusters by content (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). The largest section contains GOS-only clusters (23.40%) emphasizing the significant novelty provided by the GOS data. The next section consists of clusters containing sequences from only the known nonprokaryotic grouping (20.78%), followed closely by the section containing clusters with sequences from all three groupings (20.23%). The large known nonprokaryotic–only grouping shows that our current GOS sampling methodology will not cover all protein families, and perhaps misses some protein families that are exclusive to higher eukaryotes. The large section of clusters that include all three groupings indicates a large core of well-conserved protein families across all domains of life. In contrast, the known prokaryotic protein families are almost entirely covered by the GOS data.</p>
</sec>
<sec id="s2f"><title>Novelty Added by GOS Data</title>
<p>There are 3,995 medium and large clusters that contain only sequences from the GOS dataset. Some are divergent members of known families that failed to be merged by the clustering parameters used, or are too divergent to be detected by any current homology detection methods. The remaining clusters are completely novel families. In exploring the 3,995 GOS-only clusters, 44.9% of them contain sequences that have HMM matches, or BLAST matches to sequences in a more recent snapshot of NCBI-nr (downloaded in August 2005) than was used in this study. The recent NCBI-nr matches include phage sequences from cyanophages (P-SSM2 and P-SSM4) [<xref ref-type="bibr" rid="pbio-0050016-b044">44</xref>
] and sequences from the SAR-11 genome (<italic>Candidatus pelagibacter ubique</italic>
 HTCC1062) [<xref ref-type="bibr" rid="pbio-0050016-b045">45</xref>
]. We used profile–profile searches [<xref ref-type="bibr" rid="pbio-0050016-b039">39</xref>
] to show that an additional 12.5% of the GOS-only clusters can be linked to profiles built from Protein Data Bank (PDB), COG, or Pfam. The 2,295 clusters with detected homology are referred to as Group I clusters. The remaining 1,700 (42.6%) GOS-only clusters with no detectable homology to known families are labeled as Group II clusters.</p>
<p>We applied a guilt-by-association operon method to annotate the GOS-only clusters with a strategy that did not rely on direct sequence homology to known families. Function was inferred for the GOS-only clusters by examining their same-strand neighbors on the assembly (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Similar strategies have been successfully used to infer protein function in finished microbial genomes [<xref ref-type="bibr" rid="pbio-0050016-b046">46</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b048">48</xref>
]. Despite minimal assembly of GOS reads, many scaffolds and mini-scaffolds contain at least partial fragments of more than one predicted ORF, thereby making this approach feasible. For 90 (5.3%) of the Group II clusters, and for 214 (9.3%) of the Group I clusters, at least one Gene Ontology (GO) [<xref ref-type="bibr" rid="pbio-0050016-b049">49</xref>
] biological process term at <italic>p</italic>
-value ≤0.05 can be inferred. The inferred functions and neighbors of some of these GOS-only clusters are highlighted in <xref ref-type="table" rid="pbio-0050016-t005">Table 5</xref>
. We observed that for Group I clusters, the neighbor-inferred function is often bolstered by some information from weak homology to known sequences. While neighboring clusters as a whole are of diverse function, a number of GOS-only clusters seem to be next to clusters implicated in photosynthesis or electron transport. These GOS-only clusters could be of viral origin, as cyanophage genomes contain and express some photosynthetic genes that appear to be derived from their hosts [<xref ref-type="bibr" rid="pbio-0050016-b044">44</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b050">50</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b051">51</xref>
]. In support of these observations, we identified five photosynthesis-related clusters containing hundreds to thousands of viral sequences, including psbA, psbD, petE, SpeD, and hli in the GOS data; furthermore, our nearest-neighbor analysis of these sequences reveals the presence of multiple viral proteins (unpublished data).</p>
<p>Although the majority of GOS-only sequences are bacterial, a higher than expected proportion of the GOS-only clusters are predicted to be of viral origin, implying that viral sequences and families are poorly explored relative to other microbes. To assign a kingdom to the GOS-only clusters, we first inferred the kingdom of neighboring sequences based on the taxonomy of the top four BLAST matches to the NCBI-nr database (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). A possible kingdom was assigned to the GOS-only cluster if more than 50% of assignable neighboring sequences belong to the same kingdom. Viewed in this way, 11.8% of Group I clusters and 17.3% of Group II clusters with at least one kingdom-assigned neighbor have more than 50% viral neighbors (<xref ref-type="fig" rid="pbio-0050016-g004">Figure 4</xref>
). Only 3.3% and 3.4% of random samples of clusters with size distributions matching that of Group I and Group II clusters have more than 50% viral neighbors, while 7.7% of all clusters pass this criterion. A total of 547 GOS-only clusters contain sequences collected from the viral size fraction included in the GOS dataset. For these clusters, 38.9% of the Group I subset and 27.5% of the Group II subset with one or more kingdom-assigned neighbors would be inferred as viral, based on the conservative criteria of having more than 50% viral assignable neighbors. Several alternative kingdom assignment methods were tried (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
) and provide for a similar conclusion.</p>
<p>The GOS-only clusters also tend to be more AT-rich than sequences from a random size-matched sample of clusters (35.9% ± 8% GC content for Group II clusters versus 49.5% ± 11% GC content for sample). Phage genomes with a <italic>Prochlorococcus</italic>
 host [<xref ref-type="bibr" rid="pbio-0050016-b044">44</xref>
] are also AT rich (37% average GC content). Our analysis of the graph constructed based on inferred operon linkages between all clusters indicates that the GOS-only clusters may constitute large sets of cotranscribed genes (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
).</p>
<p>The high proportion of potentially viral novel clusters observed here is reasonable, as 60%–80% of the ORFs in most finished marine phage genomes are not homologous to known protein sequences [<xref ref-type="bibr" rid="pbio-0050016-b052">52</xref>
]. Viral metagenomics projects have reported an equally high fraction of novel ORFs [<xref ref-type="bibr" rid="pbio-0050016-b053">53</xref>
], and a recent marine metagenomics project estimated that up to 21% of photic zone sequences could be of viral origin [<xref ref-type="bibr" rid="pbio-0050016-b051">51</xref>
]. It has also been reported that 40% of ORFans (sequences that lack similarity to known proteins and predicted proteins) exist in close spatial proximity to each other in bacterial genomes, and this combined with proximity to integration signals has been used to suggest a viral horizontally transferred origin for many bacterial ORFans [<xref ref-type="bibr" rid="pbio-0050016-b054">54</xref>
]. Others have noted a clustering of ORFans in genome islands and suggested they derive from a phage-related gene pool [<xref ref-type="bibr" rid="pbio-0050016-b055">55</xref>
]. A recent analysis of genome islands from related <italic>Prochlorococcus</italic>
 found that phage-like genes and novel genes cohabit these dynamic areas of the genome [<xref ref-type="bibr" rid="pbio-0050016-b056">56</xref>
]. In our GOS-only clusters, 37 of the 1,700 clusters with no detectable similarity (2.2%) have at least ten bacterial-classified and ten viral-classified neighboring ORFs. This is 6.2-fold higher than the rate seen for the size-matched sample of all clusters (six clusters, 0.35%). This would seem to add more support to a phage origin for at least some ORFans found in bacterial genomes.</p>
<p>If a sizable portion of the novel families in the GOS data are in fact of viral origin, it suggests that we are far from fully exploring the molecular diversity of viruses, a conclusion echoed in previous studies of viral metagenomes [<xref ref-type="bibr" rid="pbio-0050016-b053">53</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b057">57</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b058">58</xref>
]. In studies of bacterial genomes, discovery of new ORFans shows no sign of reaching saturation [<xref ref-type="bibr" rid="pbio-0050016-b059">59</xref>
]. Coverage of many phage families in the GOS data may be low, given that there are inherent differences in the abundance of their presumed bacterial hosts. These GOS-only clusters were operationally defined as having at least 20 nonredundant sequences. Reducing this threshold to ten nonredundant sequences adds 7,241 additional clusters. Whether this vast diversity represents new families or is a reflection of the inability to detect distant homology will require structural and biochemical studies, as well as continued development of computational methods to identify remotely related sequences.</p>
</sec>
<sec id="s2g"><title>Comparison of Domain Profiles in GOS and PG Datasets</title>
<p>We used HMM profiling to address the question of which biochemical and biological functions are expanded or contracted in GOS compared to the largely terrestrial genomes in PG. Significant differences are seen in 68% of domains (4,722 out of the 6,975 domains that match either GOS or PG; <italic>p</italic>
-value <0.001, chi-square test). These differences reflect several factors, including differing biochemical needs of oceanic life and taxonomic biases in the two datasets. An initial comparison of these domain profiles helps shed light on these factors. 91% (964/1,056) of GOS-only domains are viral and/or eukaryotic specific (by Pfam annotation). Most of the remaining 92 domains are rare (63 domains have less than ten copies in GOS), are predominantly eukaryotic/viral, or are specific to narrow bacterial taxa without completed genome sequences. Most of the 879 PG-only domains are also rare (444 have ten or less members), and/or are restricted to tight lineages, such as <italic>Mycoplasma</italic>
 (104 matches to five domains) or largely extremeophile archaeal-specific domains (1,254 matches to 99 domains). Highly PG-enriched domains also tend to belong in these categories. Many moderately skewed domains reflect the taxonomic skew between PG and GOS. For instance, we found that a set of six sarcosine oxidase-related domains are 4.8-fold enriched in GOS (<xref ref-type="table" rid="pbio-0050016-t006">Table 6</xref>
). They are mostly found in α- and γ-proteobacteria, which are widespread in GOS. Normalizing to the taxonomic class level predicts a 1.8-fold enrichment in GOS, indicating that taxonomy alone cannot fully explain the prevalence of these proteins in oceanic bacteria.</p>
</sec>
<sec id="s2h"><title>Mysterious Lack of Characteristic Gram-Positive Domains</title>
<p>Gram-positive bacteria <italic>(Firmicutes</italic>
 and <italic>Actinobacteria)</italic>
 represent 26.7% of PG and ~12% of GOS [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
]. Given the larger size of the GOS dataset, one might predict Gram-positive–specific domains to be ~2.4-fold enriched in GOS. Instead, the opposite is consistently seen. Of 15 firmicute-specific spore-associated domains, PG has 503 members, but GOS has none. For another 22 firmicute-restricted domains of varying or unknown function, the PG/GOS ratio is 1797:77 (<xref ref-type="table" rid="pbio-0050016-t006">Table 6</xref>
). Hence, it appears that GOS Gram-positive lineages lack most of their characteristic protein domains. Two sequenced marine Gram-positives (<named-content content-type="genus-species">Oceanobacillus iheyensis</named-content>
 [<xref ref-type="bibr" rid="pbio-0050016-b060">60</xref>
] and <named-content content-type="genus-species">Bacillus sp</named-content>
. NRRL B-14911) have a large complement of these domains. However, another recently assembled genome from Sargasso sea surface waters, the actinomycete <named-content content-type="genus-species">Janibacter sp</named-content>
. HTCC2649, has just two of these domains, and may reveal a whole-genome context for this curious loss of characteristic domains.</p>
</sec>
<sec id="s2i"><title>Flagellae and Pili Are Selectively Lost from Oceanic Species</title>
<p>Flagellum components from both eubacteria and archaea are significantly underrepresented in the GOS dataset by about 2-fold (<xref ref-type="table" rid="pbio-0050016-t006">Table 6</xref>
). Ironically, at a bacterial scale, swimming may be worthwhile on an almost dry surface, but not in open water. The chemotaxis (che) operon that often directs flagellar activity is also rare in GOS. Another directional appendage, the pilus, is even more reduced, though its taxonomic distribution (mostly in proteobacteria, predominantly γ-proteobacteria) would have predicted enrichment.</p>
</sec>
<sec id="s2j"><title>Skew in Core Cellular Pathways</title>
<p>While taxonomically specialized domains are likely to be skewed by taxonomic differences, core pathways found in many or all organisms paint a different picture. We used GO term mapping and text mining to group domains into major functions and to look for consistent skews across several domains. Several core functions, including DNA-associated proteins (DNA polymerase, gyrase, topoisomerase), ribosomal subunits shared by all three kingdoms, marker proteins such as recA and dnaJ, and TCA cycle enzymes all tend to be GOS enriched. This suggests that oceanic genomes may be more compact than sequenced genomes and so have a higher proportion of core pathways.</p>
</sec>
<sec id="s2k"><title>Characteristics and Kingdom Distribution of Known Protein Domains</title>
<p>A decade ago, databases were highly biased towards proteins of known function. Today, whole-genome sequencing and structural genomics efforts have presumably reduced the biases that are a result of targeted protein sequencing. We used the Pfam database to compare the characteristics and kingdom distribution of known protein domains in the GOS dataset to that of proteins in the publicly available datasets (NCBI-nr, PG, TGI-EST, and ENS). Such an effort can be used to assess biases in these datasets, help direct future sampling efforts (of underrepresented organisms, proteins, and protein families), make more informed generalizations about the protein universe, and provide important context for determination of protein evolutionary relationships (as biased sampling could indicate expected but missing sequences).</p>
<p>For this analysis we used the nonredundant datasets (at 100% identity) discussed in <xref ref-type="fig" rid="pbio-0050016-g001">Figure 1</xref>
. We refer to the set of 3,167,979 nonredundant sequences from NCBI-nr, PG, TGI-EST, and ENS as the public-100 set and the similarly filtered set of 5,654,638 sequences from the GOS data as the GOS-100 set<italic>.</italic>
</p>
<p>About 70% of public-100 sequences and 56% of GOS-100 sequences significantly match at least one Pfam model. The most obvious difference between the sets is that the vast majority of GOS sequences are bacterial, and this has to be taken into account when comparing the numbers. Since different Pfam families appear with different frequencies in the kingdoms, we considered the results for each kingdom separately (<xref ref-type="fig" rid="pbio-0050016-g005">Figure 5</xref>
). We then evaluated all kingdoms together, with results normalized by relative abundance of members from the different kingdoms. A domain found commonly and exclusively in eukaryotes and abundant in public-100 would be expected to be found rarely in GOS-100. We used a conservative BLAST-based kingdom assignment method to assign kingdoms to the GOS sequences (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
).</p>
<p>In each kingdom, sequences in GOS-100 are less likely to match a Pfam family than those in public-100 (<xref ref-type="fig" rid="pbio-0050016-g005">Figure 5</xref>
). For the cellular kingdoms, these differences are comparatively modest. While diversity of the GOS data accounts for some of this difference, it might also be explained in part by the fragmentary nature of the GOS sequences. Viruses tell a dramatic and different story. Of public-100 viral sequences, 89.1% match a Pfam domain, while only 27.5% of GOS-100 viral sequences have a match. This tremendous difference appears to be due to heavy enrichment of the public data for minor variants of a few protein families, indicated by the sizes of the ten most populous Pfams in each kingdom (<xref ref-type="fig" rid="pbio-0050016-g005">Figure 5</xref>
). Sequences from three Pfam families (envelope glycoprotein GP120, reverse transcriptase, and retroviral aspartyl protease) account for a third of all public viral sequences. By contrast, the most populous three families in the GOS-100 data (bacteriophage T4-like capsid assembly protein [Gp20], major capsid protein Gp23, and phage tail sheath protein) account for only about 7% of public-100 sequences. Such a difference may be due to intentional oversampling of proteins that come from disease-causing organisms in the public dataset.</p>
<p>While the total proportion of proteins with a Pfam hit is fairly similar between public-100 (70%) and GOS-100 (56%) datasets, there are considerable differences with regard to the distributions of protein families within these two datasets. The most highly represented Pfam families in GOS-100 compared to public-100 are shown in <xref ref-type="table" rid="pbio-0050016-t007">Table 7</xref>
. Notably, we found that while many known viral families are absent in GOS-100, viral protein families dominate the list of the families more highly represented in GOS-100; this is presumably because of biases in the collection of previously known viral sequences. Surprisingly few bacterial families were among the most represented in GOS-100 compared with public-100. By contrast, we also observed that those families found more rarely in GOS-100 than public-100 were frequently bacterial (<xref ref-type="table" rid="pbio-0050016-t007">Table 7</xref>
). This appears to be a result of the large number of key bacterial and viral pathogen proteins in public-100 that are comparatively less abundant in the oceanic samples and/or less intensively sampled.</p>
</sec>
<sec id="s2l"><title>GOS-100 Data Suggest That a Number of “Kingdom-Specific” Pfams Actually Are Represented in Multiple Kingdoms</title>
<p>Of the 7,868 Pfam models in Pfam 17.0, 4,050 match proteins from only a single kingdom in public-100. The additional sequences from GOS-100 reveal that some of these families actually have representatives in multiple kingdoms. <xref ref-type="table" rid="pbio-0050016-t008">Table 8</xref>
 shows 12 families that have a Pfam match to at least one GOS-100 protein with an <italic>E-</italic>
value ≤ 1 × 10<sup>−10</sup>
, and which we confidently assigned to a kingdom different from that of all the public-100 matches. Because our criteria for a “confident” kingdom assignment are conservative, there are only one or a few confident assignments for each Pfam domain to a “new” kingdom. Our “confident” criteria are especially difficult to meet in the case of kingdom-crossing, due to the votes contributed by the crossing protein (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Thus, many scaffolds have no confident kingdom assignment. Our examination of each of the scaffolds responsible for a determination of kingdom-crossing confirms that each one had both a highly significant match to the Pfam model in question and an overwhelming number of votes for the unexpected kingdom. These scaffold assemblies were also manually inspected. No clear anomalies were observed. In most instances, the assemblies in question were composed of a single unitig, and as such are high-confidence assemblies. Mate pair coverage and consistent depth of coverage provide further support for the correctness of those assemblies that are built from multiple unitigs. Examples of kingdom-crossing families include indoleamine 2,3-dioxygenase (IDO), MAM domain, and MYND finger [<xref ref-type="bibr" rid="pbio-0050016-b015">15</xref>
], which have previously only been seen in eukaryotes, but we find them also to be present in bacteria. These Pfams now cross kingdoms, due either to their being more ancient than previously realized or to lateral transfer.</p>
<p>We explored the IDO family further. This family has representatives in vertebrates, invertebrates, and multiple fungal lineages [<xref ref-type="bibr" rid="pbio-0050016-b015">15</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b061">61</xref>
] in public-100. Members of the IDO family are heme-binding, and mammalian IDOs catalyze the rate-limiting step in the catabolic breakdown of tryptophan [<xref ref-type="bibr" rid="pbio-0050016-b062">62</xref>
], while family members in mollusks have a myoglobin function [<xref ref-type="bibr" rid="pbio-0050016-b063">63</xref>
]. In mammals, IDO also appears to have a role in the immune system [<xref ref-type="bibr" rid="pbio-0050016-b062">62</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b064">64</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b066">66</xref>
]. The IDO Pfam has matches to 66 proteins in public-100, all of which are eukaryotic. However, it also has matches to ten GOS-100 sequences that we confidently labeled as bacterial proteins and matches to 206 GOS-100 sequences for which a confident kingdom assignment could not be made (many of these are likely bacterial sequences due to the GOS sampling bias). To reconstruct a phylogeny of the IDO family, we searched a recent version of NCBI-nr (March 5, 2006) for IDO proteins that were not included in the public-100 dataset. The search identified two bacterial proteins from the whole genomes of the marine bacteria <named-content content-type="genus-species">Erythrobacter litoralis</named-content>
 and <italic>Nitrosococcus oceani,</italic>
 and 24 eukaryotic proteins (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). The phylogeny shown in <xref ref-type="fig" rid="pbio-0050016-g006">Figure 6</xref>
 shows 54% bootstrap support for a separation of the clade containing exclusively public-100 and NCBI-nr 2006 eukaryotic sequences from a clade with the GOS-100 sequences as well as the two NCBI-nr <named-content content-type="genus-species">E. litoralis</named-content>
 and <named-content content-type="genus-species">N. oceani</named-content>
 sequences. We confirmed this feature of the tree topology with multiple other phylogeny reconstruction methods. Curiously, there is considerable intermixing of bacterial and eukaryotic sequences in the clade of GOS-100 sequences and the two NCBI-nr bacteria. A manual inspection of the scaffolds that contain the ten GOS-100 sequences (containing the IDO domain) that we confidently labeled as bacterial, overwhelmingly supports the kingdom assignment. However, a manual inspection of the scaffolds that contain the ten GOS-100 sequences (containing the IDO domain) that we confidently labeled as eukaryotes presents a less convincing picture. These scaffolds are short, with most of them containing only two voting ORFs. Since the NCBI-nr version used in the public-100 set has IDO from eukaryotes only, the ORF with the IDO domain itself would cast four votes for eukaryotes. Thus, these GOS-100 eukaryotic labelings are not nearly as confident as the ones labeled bacterial.</p>
</sec>
<sec id="s2m"><title>Structural Genomics Implications</title>
<p>Knowledge about global protein distributions can be used to inform priorities in related fields such as structural genomics. Structural genomics is an international effort to determine the 3-D shapes of all important biological macromolecules, with a primary focus on proteins [<xref ref-type="bibr" rid="pbio-0050016-b067">67</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b072">72</xref>
]. Previous studies have shown that an efficient strategy for covering the protein structure universe is to choose protein targets for experimental structure characterization from among the largest families with unknown structure [<xref ref-type="bibr" rid="pbio-0050016-b073">73</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b074">74</xref>
]. If the structure of one family member is determined, it may be used to accurately infer the fold of other family members, even if the sequence similarity between family members is too low to enable accurate structural modeling [<xref ref-type="bibr" rid="pbio-0050016-b075">75</xref>
]. Therefore, large families are a focus of the production phase of the Protein Structure Initiative (PSI), the National Institutes of Health–funded structural genomics project that commenced in October 2005 [<xref ref-type="bibr" rid="pbio-0050016-b076">76</xref>
].</p>
<p>In March 2005, 2,729 (36%) of 7,677 Pfam families had at least one member of known structure; these families could be used to infer folds for approximately 51% of all pre-GOS prokaryotic proteins (covering 44% of residues) [<xref ref-type="bibr" rid="pbio-0050016-b074">74</xref>
]. The Pfam5000 strategy is to solve one structure from each of the largest remaining families, until a total of 5,000 families have at least one member with known structure [<xref ref-type="bibr" rid="pbio-0050016-b073">73</xref>
]. As this strategy is similar to that being used at PSI centers to choose targets, projections based on the Pfam5000 should reflect PSI results. Completion of the Pfam5000, a tractable goal within the production phase of PSI, would enable accurate fold assignment for approximately 65% of all pre-GOS prokaryotic proteins. In the GOS-100 dataset, we observed that 46% of the proteins might currently be assigned a fold based on Pfam families of known structure (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Completion of the Pfam5000 would increase this coverage to 55%.</p>
<p>The GOS sequences will affect Pfam in two ways: some will be classified in existing protein families, thus increasing the size of these families; others may eventually be classified into new GOS-specific families. Both of these will alter the relative sizes of different families, and thus their prioritization for structural genomics studies. We calculated the sizes for all Pfam families based on the number of occurrences of each family in the public-100 dataset. Proteins in GOS-100 were then added and the family sizes were recalculated. A total of 190 families that are not in the Pfam5000 based on public-100 are moved into the Pfam5000 after addition of the GOS data. The 30 largest such families are shown in <xref ref-type="table" rid="pbio-0050016-t009">Table 9</xref>
. As 20 of the 30 families are annotated as domains of unknown function in Pfam, structural characterization might be helpful in identifying their cellular or molecular functions. Reshuffling the Pfam5000 to prioritize these 190 families would improve structural coverage of GOS sequences after completion of the Pfam5000 by almost 1% relative to the original Pfam5000 (from 55.4% to 56.1%), with only a small decrease in coverage of public-100 sequences (from 67.7% to 67.5%).</p>
<p>The Pfam5000 would be further reprioritized by the classification of clusters of GOS sequences into Pfam. Assuming each cluster of pooled GOS-100 and public-100 sequences without a current Pfam match would be classified as a single Pfam family, 885 such families would replace existing families in the Pfam5000. These 885 clusters contain a total of 383,019 proteins in GOS-100 and public-100. The reprioritized Pfam5000 would also retain 1,183 families of unknown structure from the current Pfam5000; these families comprise a total of 1,040,330 proteins in GOS-100 and public-100.</p>
</sec>
<sec id="s2n"><title>Known Protein Families and Increased Diversity Due to GOS Data</title>
<p>Several protein families serve as examples to further highlight the diversity added by the GOS dataset. In this paper, we examined UV irradiation DNA damage repair enzymes, phosphatases, proteases, and the metabolic enzymes glutamine synthetase and RuBisCO (<xref ref-type="table" rid="pbio-0050016-t010">Table 10</xref>
). The RecA family (unpublished data) and the kinase family [<xref ref-type="bibr" rid="pbio-0050016-b077">77</xref>
] have also been explored in the context of the GOS data. There are more than 5,000 RecA and RecA-like sequences in the GOS dataset (<xref ref-type="table" rid="pbio-0050016-t010">Table 10</xref>
). An analysis of the RecA phylogeny including the GOS data reveals several completely new RecA subfamilies. A detailed study of kinases in the GOS dataset demonstrated the power of additional sequence diversity in defining and exploring protein families [<xref ref-type="bibr" rid="pbio-0050016-b077">77</xref>
]. The discovery of 16,248 GOS protein kinase–like enzymes enabled the definition and analysis of 20 distinct kinase-like families. The diverse sequences allowed the definition of key residues for each family, revealing novel core motifs within the entire superfamily, and predicted structural adaptations in individual families. This data enabled the fusion of choline and aminoglycoside kinases into a single family, whose sequence diversity is now seen to be at least as great as the eukaryotic protein kinases themselves.</p>
</sec>
<sec id="s2o"><title>Proteins Involved in the Repair of UV-Induced DNA Damage</title>
<p>Much of the attention in studies of the microbes in the world's oceans has justifiably focused on phototrophy, such as that carried out by the proteorhodopsin proteins. Previously, in the Sargasso Sea study [<xref ref-type="bibr" rid="pbio-0050016-b010">10</xref>
] it was shown that shotgun sequencing reveals a much greater diversity of proteorhodopsin-like proteins than was previously known from cloning and PCR studies. However, along with the potential benefits of phototrophy come many risks, such as the damage caused to cells by exposure to solar irradiation, especially the UV wavelengths. Organisms deal with the potential damage from UV irradiation in several ways, including protection (e.g., UV absorption), tolerance, and repair [<xref ref-type="bibr" rid="pbio-0050016-b078">78</xref>
]. Our examination of the protein family clusters reveals that the GOS data provides an order of magnitude increase in the diversity (in both numbers and types) of homologs of proteins known to be involved in pathways specifically for repairing UV damage.</p>
<p>One aspect of the diversity of UV repair genes is seen in the overrepresentation of photolyase homologs in the GOS data (see <xref ref-type="table" rid="pbio-0050016-t010">Table 10</xref>
). Photolyases are enzymes that chemically reverse the UV-generated inappropriate covalent bonds in cyclobutane pyrimidine dimers and 6–4 photoproducts [<xref ref-type="bibr" rid="pbio-0050016-b079">79</xref>
]. The massive numbers of homologs of these proteins in the GOS data (11,569 GOS proteins in four clusters; see <xref ref-type="table" rid="pbio-0050016-t010">Table 10</xref>
) is likely a reflection of their presence in diverse species and the existence of novel functions in this family. New repair functions could include repair of other forms of UV dimers (e.g., involving altered bases), use of novel wavelengths of light to provide the energy for repair, repair of RNA, or repair in different sequence contexts. In addition, some of these proteins may be involved in regulating circadian rhythms, as seen for photolyase homologs in various species. Our findings are consistent with the recent results of a comparative metagenomic survey of microbes from different depths that found an overabundance of photolyase-like proteins at the surface [<xref ref-type="bibr" rid="pbio-0050016-b051">51</xref>
].</p>
<p>A good deal was known about the functions and diversity of photolyases prior to this project. However, much less is known about other UV damage–specific repair enzymes, and examination of the GOS data reveals a remarkable diversity of each of these. For example, prior to this project, there were only some 25 homologs of UV dimer endonucleases (UVDEs) available [<xref ref-type="bibr" rid="pbio-0050016-b080">80</xref>
], and most of these were from the <italic>Bacillus</italic>
 species. There are 420 homologs of UVDE (cluster 6239) in the GOS data representing many new subfamilies (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
A and <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). A similar pattern is seen for spore lyases (which repair a UV lesion specific to spores [<xref ref-type="bibr" rid="pbio-0050016-b081">81</xref>
]) and the pyrimidine dimer endonuclease (DenV, which was originally identified in T4 phage [<xref ref-type="bibr" rid="pbio-0050016-b082">82</xref>
]). We believe this will also be true for UV dimer glycosylases [<xref ref-type="bibr" rid="pbio-0050016-b083">83</xref>
], but predictions of function for homologs of these genes are difficult since they are in a large superfamily of glycosylases.</p>
<p>Our analysis of the kingdom classification assignments suggests that the diversity of UV-specific repair pathways is seen for all types of organisms in the GOS samples. This apparently extends even to the viral world (e.g., 51 of the UVDE homologs are assigned putatively to viruses), suggesting that UV damage repair may be a critical function that phages provide for themselves and their hosts in ocean surface environments. Based on the sheer numbers of genes, their sequence diversity, and the diversity of types of organisms in which they are apparently found, we conclude that many novel UV damage–repair processes remain to be discovered in organisms from the ocean surface water.</p>
</sec>
<sec id="s2p"><title>Evidence of Reversible Phosphorylation in the Oceans</title>
<p>Reversible phosphorylation of proteins represents a major mechanism for cellular processes, including signal transduction, development, and cell division [<xref ref-type="bibr" rid="pbio-0050016-b084">84</xref>
]. The activity of protein kinases and phosphatases serve as antagonistic regulators of the cellular response. Protein phosphatases are divided into three major groups based on substrate specificity [<xref ref-type="bibr" rid="pbio-0050016-b085">85</xref>
]. The Mg<sup>2+</sup>
- or Mn<sup>2+</sup>
-dependent phosphoserine/phosphothreonine protein phosphatase family, exemplified by the human protein phosphatase 2C (PP2C), represents the smallest group in number. An understanding of their physiological roles has only recently begun to emerge. In eukaryotes, one of the major roles of PP2C activity is to reverse stress-induced kinase cascades [<xref ref-type="bibr" rid="pbio-0050016-b086">86</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b089">89</xref>
].</p>
<p>We identified 613 PP2C-like sequences in the GOS dataset, and they are grouped into two clusters (<xref ref-type="table" rid="pbio-0050016-t010">Table 10</xref>
). These sequences contain at least seven motifs known to be important for phosphatase structure and function [<xref ref-type="bibr" rid="pbio-0050016-b090">90</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b091">91</xref>
]. Invariant residues involved in metal binding (aspartate in motifs I, II, VIII) and phosphate ion binding (arginine in motif I) are highly conserved among the GOS sequences.</p>
<p>Using the catalytic domain portion of these sequences we constructed a phylogeny showing that despite the overall conserved structure of the PP2C family of proteins, the known bacterial PP2C-like sequences group together with the GOS bacterial PP2C-like sequences (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
B, <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Furthermore, the eukaryotic PP2Cs display a much greater degree of sequence divergence compared to the bacterial PP2C sequences.</p>
<p>We also examined the combined dataset of PP2C-like phosphatases further for potential differences in amino acid composition between the bacterial and eukaryotic groups. We observed a striking distinction between the eukaryotic and bacterial PP2C-like phosphatases in motif II, where a histidine residue (His62 in human PP2Ca) is conserved in more than 90% of sequences, but not observed in the bacterial group. The bacterial PP2C group contains a methionine (at the corresponding position) in the majority of the cases (70%). This histidine residue is involved in the formation of a beta hairpin in the crystal structure of human PP2C [<xref ref-type="bibr" rid="pbio-0050016-b091">91</xref>
]. Furthermore, His62 is proposed to act as a general acid for PP2C catalysis [<xref ref-type="bibr" rid="pbio-0050016-b092">92</xref>
]. Both amino acids lie in the proximity of the phosphate-binding domain, but at this time it is unclear how the difference at this position would contribute to the overall structure and function of the two PP2C groups. Nonetheless, the large number of diverse PP2C-like phosphatases in this dataset allowed us to identify a previously unrecognized key difference between bacterial and eukaryotic PP2Cs.</p>
<p>Bacterial genes that perform closely related functions can be organized in close proximity to each other and often in functional units. Linked Ser/Thr kinase-phosphatase genetic units have been described in several bacterial species, including <italic>Streptococcus pneumoniae, Bacillus subtilis,</italic>
 and <italic>Mycobacterium tuberculosis</italic>
 [<xref ref-type="bibr" rid="pbio-0050016-b093">93</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b096">96</xref>
]. Two major neighboring clusters are found to be associated with the set of PP2C-like phosphatases in the GOS bacterial group. We observed that one of these clusters contained a protein serine/threonine kinase domain as its most common Pfam domain. An additional neighboring cluster found to be associated with the GOS set of bacterial PP2Cs was identified as a set of sequences containing a PASTA (penicillin-binding protein and serine/threonine kinase–associated) domain. This domain is unique to bacterial species, and is believed to play important roles in regulating cell wall biosynthesis [<xref ref-type="bibr" rid="pbio-0050016-b097">97</xref>
].</p>
<p>Our identification of a conserved group of unique PP2C-like phosphatases in the GOS dataset significantly increases the number and diversity of this enzyme family. This analysis of the NCBI-nr, PG ORFs, TGI-EST ORFs, and ENS datasets along with the sequences obtained from the GOS dataset significantly increases the overall number of PP2C-like sequences from that estimated just a year ago [<xref ref-type="bibr" rid="pbio-0050016-b098">98</xref>
]. The presence of genes encoding bacterial serine/threonine kinase domains located adjacent to PP2Cs in the GOS data supports the notion that the process of reversible phosphorylation on Ser/Thr residues controls important physiological processes in bacteria.</p>
</sec>
<sec id="s2q"><title>Proteases in GOS Data</title>
<p>Proteases are a group of enzymes that degrades other proteins and, as such, plays important roles in all organisms [<xref ref-type="bibr" rid="pbio-0050016-b099">99</xref>
]. On the basis of their catalysis mechanism, proteases are divided into six distinct catalytic types: aspartic, cysteine, metallo, serine, threonine, and glutamic proteases [<xref ref-type="bibr" rid="pbio-0050016-b099">99</xref>
]. They differ from each other by the presence of specific amino acids in the active site and by their mode of action. The MEROPS database [<xref ref-type="bibr" rid="pbio-0050016-b100">100</xref>
] is a comprehensive source of information for this large divergent group of sequences and provides a widely accepted classification of proteases into families, based on the amino acid sequence comparison, and then into clans based on the similarity of their 3-D structures.</p>
<p>We identified 222,738 potential proteases in the GOS dataset based on similarity to sequences in MEROPS (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). According to our clustering method, 95% of these sequences are grouped into 190 clusters, with each cluster on the average containing more than 1,100 GOS sequences. These sequences were compared to proteases in NCBI-nr. There are groups of proteases in NCBI-nr that are highly redundant. For example, there are a large number of viral proteases from HIV-1 and hepatitis C viruses that dominate the NCBI-nr protease set. Thus, we computed a nonredundant set of NCBI-nr proteases and, for the sake of consistency, a nonredundant set of proteases from the GOS set using the same parameters. The majority of proteases in both sets are dominated by cysteine, metallo, and serine proteases. The GOS dataset is dominated by proteases belonging to the bacterial kingdom. That is not surprising, given the filter sizes used to collect the samples. In NCBI-nr the proteases are more evenly distributed between the bacterial and the eukaryotic kingdoms.</p>
<p>Our comparison of the protease clan distribution of the bacterial sequences in the NCBI-nr and GOS sets reveals that the distribution of clans is very similar for metallo- and serine proteases. However, the distribution of clans in aspartic and cysteine proteases is different in the two datasets. Among aspartic proteases, the most visible difference is the increased ratio of proteases of the AC clan and the decreased ratio in the AD clan. Proteases in the former clan are involved in bacterial cell wall production, while those in the latter clan are involved in pilin maturation and toxin secretion [<xref ref-type="bibr" rid="pbio-0050016-b099">99</xref>
]. Among cysteine proteases, the most apparent is the decrease in the CA clan and an increase in the number of proteases from the PB(C) clan. Bacterial members of the CA clan are mostly involved in degradation of bacterial cell wall components and in various aspects of biofilm formation [<xref ref-type="bibr" rid="pbio-0050016-b099">99</xref>
]. It is possible that both activities are less important for marine bacteria present in surface water. Proteases from the PB(C) clan are involved in activation (including self-activation) of enzymes from acetyltransferase family. In fungi this family is involved in penicillin synthesis, while their function in bacteria is unknown [<xref ref-type="bibr" rid="pbio-0050016-b099">99</xref>
].</p>
<p>We were unable to detect any caspases (members of the CD clan) in the GOS data. This is consistent with the apoptotic cell death mechanism being present only in multicellular eukaryotes, which, based on the filter sizes, are expected to be very rare in the GOS dataset.</p>
</sec>
<sec id="s2r"><title>Metabolic Enzymes in the GOS Data</title>
<p>To gain insights into the diversity of metabolism of the organisms in the sea, we studied the abundance and diversity of glutamine synthetase (GS) and ribulose 1,5-bisphosphate carboxylase/oxygenase (RuBisCO), two key enzymes in nitrogen and carbon metabolism.</p>
<p>GS is the central player of nitrogen metabolism in all organisms on earth. It is one of the oldest enzymes in evolution [<xref ref-type="bibr" rid="pbio-0050016-b101">101</xref>
]. It converts ammonia and glutamate into glutamine that can be utilized by cells. GS can be classified into three types based on sequence [<xref ref-type="bibr" rid="pbio-0050016-b101">101</xref>
]. Type I has been found only in bacteria, and it forms a dodecameric structure [<xref ref-type="bibr" rid="pbio-0050016-b102">102</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b103">103</xref>
]. Type II has been found mainly in eukaryotes, and in some bacteria. Type III GS is less well studied, but has been found in some anaerobic bacteria and cyanobacteria. There are 18 active site residues in both bacterial and eukaryotic GS that play important roles in binding substrates and catalyzing the enzymatic reactions [<xref ref-type="bibr" rid="pbio-0050016-b104">104</xref>
].</p>
<p>We found 9,120 GS and GS-like sequences in the GOS data (<xref ref-type="table" rid="pbio-0050016-t010">Table 10</xref>
). Using profile HMMs [<xref ref-type="bibr" rid="pbio-0050016-b041">41</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b105">105</xref>
] constructed from known GS sequences of different types, we were able to classify 4,350 sequences as type I GS, 1,021 sequences as type II GS, and 469 sequences as type III GS (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
).</p>
<p>The number of type II GS sequences found in the GOS data is surprisingly high, since previously type II GS were considered to be mainly eukaryotic and very few eukaryotic organisms were expected to be included in the GOS sequencing (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
C and <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). We used gene neighbor analysis to classify the origin of GS genes by the nature of other proteins found on the same scaffold. Using this approach, most of the neighboring genes of the type II GS in the GOS data are identified as bacterial genes. The neighboring genes of the type II GS include nitrogen regulatory protein PII, signal transduction histidine kinase, NH<sub>3</sub>
-dependent NAD<sup>+</sup>
 synthetase, A/G-specific adenine glycosylase, coenzyme PQQ synthesis protein c, pyridoxine biosynthesis enzyme, aerobic-type carbon monoxide dehydrogenase, etc. We were able to assign more than 90% of the type II GS sequences in the GOS data to bacterial scaffolds based on a BLAST-based kingdom assignment method (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Both neighboring genes and kingdom assignments suggest that most of the type II GS sequences in the GOS data come from bacterial organisms. In comparison, the same type II GS profile HMM detects only 12 putative type II GS sequences from the PG dataset of 222 prokaryotic genomes. Within these, there are only seven unique type II GS sequences and six unique bacterial species represented. The reason why bacteria in the ocean have so many type II GS genes is unclear.</p>
<p>Two hypotheses have been raised to explain the origin of type II GS in bacterial genomes: lateral gene transfer from eukaryotic organisms [<xref ref-type="bibr" rid="pbio-0050016-b106">106</xref>
] and gene duplication prior to the divergence of prokaryotes and eukaryotes [<xref ref-type="bibr" rid="pbio-0050016-b101">101</xref>
]. The type II GS sequences in the predominantly bacterial GOS data are not only abundant, but also diverse and divergent from most of known eukaryotic GS sequences (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
C). This makes the hypothesis of lateral gene transfer less favorable. If the GS gene duplication preceded the prokaryote–eukaryote divergence according to the gene duplication hypothesis, it is possible that many oceanic organisms retained type II GS genes during evolution.</p>
<p>Interestingly, we found 19 cases where a type I GS gene is adjacent to a type II GS gene on the same scaffold. Both GS genes seem to be functional based on the high degree of conservation of active site residues. The same gene arrangement was observed previously in <named-content content-type="genus-species">Frankia alni</named-content>
 CpI1 [<xref ref-type="bibr" rid="pbio-0050016-b107">107</xref>
]. The functional significance of maintaining two types of GS genes adjacent to one another in the genome remains to be elucidated. Most of the sequences of these GS genes are highly similar. We examined the geographic distribution of these adjacent GS sequences across all the GOS samples. They are mainly found in the samples taken from two sites. Their geographic distribution is significantly different from the distributions of types I and II GS across the samples. The high sequence similarity among the adjacent GS pairs and their geographic distribution suggest that these adjacent GS sequences may come from only a few closely related organisms. This is consistent with the protein sequence tree of type II GS, where the type II GS sequences from the GS gene pairs mainly reside in two distinct branches (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
C).</p>
<p>The active site residues are very well conserved in all GS sequences in the GOS data, except one residue, Y179, which coordinates the ammonium-binding pocket. We observed substitutions of Y179 to phenylalanine in about half of the type II GS sequences. The activity of type I GS in some bacteria is regulated by adenylylation at residue Tyr397. In the GOS data, Tyr397 is relatively conserved in type I GS, with variations to phenylalanine and tryptophan in about half of the sequences. This indicates that the activity of some of the type I GS is not regulated by adenylylation, as shown previously in some Gram-positive bacteria [<xref ref-type="bibr" rid="pbio-0050016-b108">108</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b109">109</xref>
].</p>
<p>RuBisCO is the key enzyme in carbon fixation. It is the most abundant enzyme on earth [<xref ref-type="bibr" rid="pbio-0050016-b110">110</xref>
] and plays an important role in carbon metabolism and CO<sub>2</sub>
 cycle. RuBisCO can be classified into four forms. Form I has been found in both plants and bacteria, and has an octameric structure. Form II has been found in many bacteria, and it forms a dimer in <italic>Rhodospirillum rubrum.</italic>
 Form III is mainly found in archaea, and forms various oligomers. Form IV, also called the RuBisCO-like protein (RLP), has been recently discovered from bacterial genome-sequencing projects [<xref ref-type="bibr" rid="pbio-0050016-b111">111</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b112">112</xref>
]. RLP represents a group of proteins that do not have RuBisCO activity, but resemble RuBisCO in both sequence and structure [<xref ref-type="bibr" rid="pbio-0050016-b111">111</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b113">113</xref>
]. The functions of RLPs are largely unknown and seem to differ from each other.</p>
<p>Contrary to the large number of GS sequences, we identified only 428 sequences homologous to the RuBisCO large subunit in the GOS data. The small number of RuBisCO sequences may partly be due to the fact that larger-sized bacterial organisms were not included in the sequencing because of size filtering. However, it could also indicate that CO<sub>2</sub>
 is not the major carbon source for these sequenced ocean organisms.</p>
<p>The RuBisCO homologs in the GOS data are more diverse than the currently known RuBisCOs (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
D, <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Six of 19 active site residues—N123, K177, D198, F199, H327, and G404—are not well conserved in all sequences, suggesting that the proteins with these mutations may have evolved to have new functions, such as in the case of RLPs. From the studies of the RLPs from <named-content content-type="genus-species">Chlorobium tepidum</named-content>
 and <named-content content-type="genus-species">B. subtilis</named-content>
 [<xref ref-type="bibr" rid="pbio-0050016-b111">111</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b114">114</xref>
], it has been shown that the active site of RuBisCO can accommodate different substrates and is potentially capable of evolving new catalytic functions [<xref ref-type="bibr" rid="pbio-0050016-b113">113</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b114">114</xref>
]. On the other hand, two sequence motifs, helices αB and α8, that are not involved in substrate binding and catalytic activity are well conserved in the GOS RuBisCO sequences. The higher degree of conservation of these nonactive site residues than that of active site residues suggests that these motifs are important for their structure, function, or interaction with other proteins.</p>
<p>We found 47 (31 at 90% identity filtering) GOS sequences in the branch with known RLP sequences in a phylogenetic tree of RuBisCO (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
D). In this phylogenetic tree, in addition to the clades for each of the four forms of RuBisCO, there are also new groups of 65 (58 at 90% identity filtering) GOS sequences that do not cluster with any known RuBisCO sequences. This indicates that there could be more than one type of RuBisCO-like protein existing in organisms. The novel groups of RuBisCO homologs in the GOS data also suggest that we have not fully explored the entire RuBisCO family of proteins (<xref ref-type="fig" rid="pbio-0050016-g007">Figure 7</xref>
D).</p>
</sec>
<sec id="s2s"><title>GOS Data and Remote Homology Detection</title>
<p>The addition of GOS sequences may help greatly in defining the range and diversity of many known protein families, both by addition of many new sequences and by the increased diversity of GOS sequences. Our comparison of HMM scores for GOS sequences with those from the other four datasets shows that GOS sequences consistently tend to have lower scores, which indicates additional diversity from that captured in the original HMM (<xref ref-type="fig" rid="pbio-0050016-g008">Figure 8</xref>
). The addition of GOS data into domain profiles may broaden the profile and allow it to detect additional remote family members in both GOS and other datasets. As a trial, we rebuilt the Pfam model PF01396, which describes a zinc finger domain within bacterial DNA topoisomerase. The original model finds 821 matches to 481 proteins in NCBI-nr. Our model that includes GOS sequences reveals 1,497 matches to 722 sequences, an increase of 50% in sequences and 82% in domains (most topoisomerases have three such domains, of which one is divergent and difficult to detect). Of these new matches, 104 are validated by the presence of additional topoisomerase domains, or they are annotated as topoisomerase, while most others are unannotated or similar to other DNA-modifying enzymes not previously thought to have zinc finger domains.</p>
<p>HMM profiles can be further exploited by using matches beyond the conservative trusted cutoff (TC) used in this study. For instance, the Pfam for the poxvirus A22 protein family has no GOS matches above the TC, but 137 matches with <italic>E-</italic>
values of 1 × 10<sup>−3</sup>
 to 1 × 10<sup>−10</sup>
, containing a short conserved motif overlap with A22 proteins. Alignment of these matches shows an additional two short motifs in common with A22, establishing their homology, and using a profile HMM, we found a total of 269 family members in GOS and eight family members in NCBI-nr. Many members of this new family are surrounded by other novel clusters, or are in putative viral scaffolds, suggesting that these weak matches are an entry point into a new clade of viruses.</p>
</sec>
<sec id="s2t"><title>ORFans with Matches in GOS Data</title>
<p>Further evidence of the diversity added by GOS sequences is provided by their matches to ORFans. ORFans are sequences in current protein databases that do not have any recognizable homologs [<xref ref-type="bibr" rid="pbio-0050016-b117">117</xref>
]. ORFan sequences (discounting those that may be spurious gene predictions) represent genes with organism-specific functions or very remote homologs of known families. They have the potential to shed light on how new proteins emerge and how old ones diversify.</p>
<p>We identified 84,911 ORFans (5,538 archaea, 35,292 bacteria, 37,427 eukaryotic, 5,314 virus, and 1,340 unclassified) from the NCBI-nr dataset using CD-HIT [<xref ref-type="bibr" rid="pbio-0050016-b116">116</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b117">117</xref>
] and BLAST (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). Of these, 6,044 have matches to GOS sequences using BLAST (<italic>E-</italic>
value ≤1 × 10<sup>−6</sup>
). <xref ref-type="fig" rid="pbio-0050016-g009">Figure 9</xref>
 shows the distribution of the matched ORFans grouped by organisms, number of their GOS matches, and the lowest <italic>E-</italic>
value of the matches. We found matches to GOS sequences for 13%, 6.3%, 0.89%, and 8.9% of bacterial, archaeal, eukaryotic, and viral ORFans, respectively. While most of these ORFans have very few GOS matches, 626 of them have ≥20 GOS matches. The similarities between GOS sequences and eukaryotic ORFans are much weaker than those between GOS sequences and noneukaryotic ORFans. The average sequence identity between eukaryotic ORFans and their closest GOS matches is 38%. This is 6% lower than the identity between noneukaryotic ORFans and their closest GOS matches.</p>
<p>The ORFans that match GOS sequences are from approximately 600 organisms. <xref ref-type="table" rid="pbio-0050016-t011">Table 11</xref>
 lists the 20 most populated organisms. Out of the 6,044 matched ORFans, approximately 2,000 are from these 20 organisms. For example, <named-content content-type="genus-species">Rhodopirellula baltica</named-content>
 SH 1, a marine bacterium, has 7,325 proteins deposited in NCBI-nr. We identified 1,418 ORFans in this organism, of which 322 have GOS matches. Another interesting example in this list is <italic>Escherichia coli.</italic>
 Although there are >20 different strains sequenced, 168 ORFans are identified in strain CFT073, and 67 of them have GOS matches. The only eukaryotic organism in this list is <named-content content-type="genus-species">Candida albicans</named-content>
 SC5314, a fungal human pathogen, which has 49 ORFans with GOS matches.</p>
<p>We examined a small but interesting subset of the ORFans that have 3-D structures deposited in PDB. Out of 65 PDB ORFans, GOS matches for eight of them are found (see Supporting Information for their PDB identifiers and names). They include four restriction endonucleases, three hypothetical proteins, and a glucosyltransferase.</p>
<p>GOS sequences can play an important role in identifying the functions of existing ORFans or in confirming protein predictions. For example, we found that the hypothetical protein AF1548, which is a PDB ORFan, has matches to 16 GOS sequences. A PSI-BLAST search with AF1548 as the query against a combined set of GOS and NCBI-nr identified several significant restriction endonucleases after three iterations. With the support of 3-D structure and multiple sequence alignment of AF1548 and its GOS matches, we predict that AF1548 along with its GOS homologs are restriction endonucleases (<xref ref-type="fig" rid="pbio-0050016-g010">Figure 10</xref>
). When combined with an established consensus of active sites of the related endonucleases families [<xref ref-type="bibr" rid="pbio-0050016-b118">118</xref>
], we predicted three catalytic residues.</p>
</sec>
<sec id="s2u"><title>Genome Sequencing Projects and Protein Exploration</title>
<p>With respect to protein exploration and novel family discovery, microbial sequencing offers more promise compared to sequencing more mammalian genomes. This is illustrated by <xref ref-type="fig" rid="pbio-0050016-g011">Figure 11</xref>
<bold>,</bold>
 where the number of clusters that protein predictions from various finished mammalian genomes fall into was compared to the number of clusters that similar-sized random subsets of microbial sequences fall into (see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
). As the figure shows, the rate of protein family discovery is higher for microbes than for mammals. Indeed, the rate of new family discovery is plateauing for mammalian sequences. This is not surprising, as mammalian divergence from a common ancestor is much more recent than microbial divergence from a common ancestor, which suggests that mammals will share a larger core set of less-diverged proteins. Microbial sequencing is also more cost effective than mammalian sequencing for acquiring protein sequences because microbial protein density is typically 80%–90% versus 1%–2% for mammals. This could be addressed with mammalian mRNA sequencing, but issues with acquiring rarely expressed mRNAs would need to be considered. There are, of course, other reasons to sequence mammalian genomes, such as understanding mammalian evolution and mammalian gene regulation.</p>
</sec>
<sec id="s2v"><title>Conclusions</title>
<p>The rate of protein family discovery is approximately linear in the (current) number of protein sequences. Additional sequencing, especially of microbial environments, is expected to reveal many more protein families and subfamilies. The potential for discovering new protein families is also supported by the GOS diversity seen at the nucleotide level across the different sampling sites [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
]. Averaged over the sites, 14% of the GOS sequence reads from a site are unique (at 70% nucleotide identity) to that site [<xref ref-type="bibr" rid="pbio-0050016-b030">30</xref>
].</p>
<p>The GOS data provides almost complete coverage of known prokaryotic protein families. In addition, it adds a great deal of diversity to many known families and offers new insights into the evolution of these families. This is illustrated using several protein families, including UV damage–repair enzymes, phosphatases, proteases, glutamine synthetase, RuBisCO, RecA (unpublished data), and kinases [<xref ref-type="bibr" rid="pbio-0050016-b077">77</xref>
]. Only a handful of protein families have been examined thus far, and many thousands more remain to be explored.</p>
<p>The protein analysis presented indicates that we are far from exploring the diversity of viruses. This is reflected in several of the analyses. The GOS-only clusters show an overrepresentation of sequences of viral origin. In addition, our domain analysis using HMM profiling shows a lower Pfam coverage of the GOS sequences in the viral kingdom compared to the other kingdoms. At least two of the protein families we explored in detail (UV repair enzymes and glutamine synthetase) contain abundant new viral additions. The extraordinary diversity of viruses in a variety of environmental settings is only now beginning to be understood [<xref ref-type="bibr" rid="pbio-0050016-b057">57</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b119">119</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b121">121</xref>
]. A separate analysis of GOS microbial and viral sequences (unpublished data) shows that multiple viral protein clusters contain significant numbers of host-derived proteins, suggesting that viral acquisition of host genes is quite widespread in the oceans.</p>
<p>Data generated by this GOS study and similar environmental shotgun sequencing studies present their own analysis challenges. Methods for various analyses (e.g., sequence alignment, profile construction, phylogeny inference, etc.) are generally designed and optimized to work with full sequences. They have to be tailored to analyze the mostly fragmentary sequences that are generated by these projects. Nevertheless, these data are a valuable source of new discoveries. These data have the potential to refine old hypotheses and make new observations about proteins and their evolution. Our preliminary exploration of the GOS data identified novel protein families and also showed that many ORFan sequences from current databases have homologs in these data. The diversity added by GOS data to protein families also allows for the building of better profile models and thereby improves remote homology detection. The discovery of kingdom-crossing protein families that were previously thought to be kingdom-specific presents evidence that the GOS project has excavated proteins of more ancient lineage than that previously known, or that have undergone lateral gene transfer. This is another example of how metagenomics studies are changing our understanding of protein sequences, their evolution, and their distribution across the various forms of life and environments. Biases in the currently published databases due to oversampling of some proteins or organisms are illuminated by environmental surveys that lack such biases. Such knowledge can help us make better predictions of the real distribution patterns of proteins in the natural world and indicate where increased sampling would be likely to uncover new families or family members of tremendous diversity (such as in the viral kingdom).</p>
<p>These data have other significant implications for the fields of protein evolution and protein structure prediction. Having several hundreds or even tens of thousands of diverse proteins from a family or examples of a specific protein fold should provide new approaches for developing protein structure prediction models. Development of algorithms that consider the alignments of all these family members/protein folds and analyze how amino acid sequence can vary without significantly altering the tertiary structure or function may provide insights that can be used to develop new ab inito methods for predicting protein structures. These same datasets could also be used to begin to understand how a protein evolves a new function. Finally, this large database of amino acid sequence data could help to better understand and predict the molecular interactions between proteins. For example, they may be used to predict the protein–protein interactions so critical for the formation of specific functional complexes within cells.</p>
<p>The GOS data also have implications for nearly all computational methods relying on sequence data. The increase in the number of known protein sequences presents challenges to many algorithms due to the increased volume of sequences. In most cases this increase in sequence data can be compensated for with additional CPU cycles, but it is also a foreshadowing of times to come as the pace of large-scale sequence-collecting accelerates. A related challenge is the increase in the diversity of protein families, with many new divergent clades present. With more protein similarity relationships falling into the twilight zone overlapping with random sequence similarity, the number of false positives for homology detection methods increases, making the true relationships more difficult to identify. Nevertheless, a deeper knowledge of protein sequence and family diversity introduces unprecedented opportunities to mine similarity relationships for clues on molecular function and molecular interactions as well as providing much expanded data for all methods utilizing homologous sequence information data.</p>
<p>The GOS dataset has demonstrated the usefulness of large-scale environmental shotgun sequencing projects in exploring proteins. These projects offer an unbiased view of proteins and protein families in an environmental sample. However, it should be noted that the GOS data reported here are limited to mostly ocean surface microbes. Even with this targeted sampling a tremendous amount of diversity is added to known families, and there is evidence for a large number of novel families. Additional data from larger filter sizes (that will sample more eukaryotes) coupled with metagenomic studies of different environments like soil, air, deep sea, etc. will help to achieve the ultimate goal of a whole-earth catalog for proteins.</p>
</sec>
</sec>
<sec sec-type="materials|methods" id="s3"><title>Materials and Methods</title>
<sec id="s3a"><title>Data description.</title>
<p>NCBI-nr [<xref ref-type="bibr" rid="pbio-0050016-b031">31</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b032">32</xref>
] is the single largest publicly available protein resource and includes protein sequences submitted to SWISS-PROT (curated protein database) [<xref ref-type="bibr" rid="pbio-0050016-b122">122</xref>
], PDB (a database of amino acid sequences with solved structures) [<xref ref-type="bibr" rid="pbio-0050016-b123">123</xref>
], PIR (Protein Information Resource) [<xref ref-type="bibr" rid="pbio-0050016-b124">124</xref>
], and PRF (Protein Research Foundation). In addition, NCBI-nr also contains protein predictions from DNA sequences from both finished and unfinished genomes in GenBank [<xref ref-type="bibr" rid="pbio-0050016-b125">125</xref>
], EMBL [<xref ref-type="bibr" rid="pbio-0050016-b126">126</xref>
], and DNA Databank of Japan (DDBJ) [<xref ref-type="bibr" rid="pbio-0050016-b127">127</xref>
]. The nonredundancy in NCBI-nr is only to the level of distinct sequences, and any two sequences of the same length and content are merged into a single entry. NCBI-nr contains partial protein sequences and is not a fully curated database. Therefore it also contains contaminants in the form of sequences that are falsely predicted to be proteins.</p>
<p>Expressed sequence tag (EST) databases also provide the potential to add a great deal of information to protein exploration and contain information that is not well represented in NCBI-nr. To this end, assemblies of EST sequences from the TIGR Gene Indices [<xref ref-type="bibr" rid="pbio-0050016-b034">34</xref>
], an EST database, were included in this study. To minimize redundancy, only EST assemblies from those organisms for which the full genome is not yet known, were included. The protein predictions on metazoan genomes that are fully sequenced and annotated were obtained by including the Ensembl database [<xref ref-type="bibr" rid="pbio-0050016-b035">35</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b036">36</xref>
] in this study.</p>
<p>Both finished and unfinished sequences from prokaryotic genome projects submitted to NCBI were included. The protein predictions from the individual sequencing projects are submitted to NCBI-nr. Nevertheless, these genomes were included in this dataset both for the purpose of evaluating our approach and also for the purpose of identifying any proteins that were missed by the annotation process used in these projects.</p>
<p>Thus, for this study the following publicly available datasets, all downloaded on February 10, 2005—NCBI-nr, PG, TGI-EST, and ENS—were used. The organisms in the PG set and the TGI-EST set are listed in <xref ref-type="supplementary-material" rid="pbio-0050016-sd001">Protocol S1</xref>
.</p>
</sec>
<sec id="s3b"><title>Assembly of the GOS dataset.</title>
<p>Initial assembly (construction of “unitigs”) was performed so that only overlaps of at least 98% DNA sequence identity and no conflicts with other overlaps were accepted. False assemblies at this phase of the assembler are extremely rare, even in the presence of complex datasets [<xref ref-type="bibr" rid="pbio-0050016-b037">37</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b128">128</xref>
]. Paired-end (also known as mate-pair) data were then used to order, orient, and merge unitigs into the final assemblies, but only when two mate pairs or a single mate pair and an overlap between unitigs implied the same layout. In one respect, mate pair data was used more aggressively than is typical in assembly of a single genome in that depth-of-coverage information was largely ignored [<xref ref-type="bibr" rid="pbio-0050016-b010">10</xref>
]. This potentially allows chimeric assemblies through a repeat within a genome or through an ortholog between genomes. Thus, a conclusion that relies on the correctness of a single assembly involving multiple unitigs should be considered tentative until the assembly can be confirmed in some way. Assemblies involved in key results in this paper were subjected to expert manual review based on thickness of overlaps, presence of well-placed mate pairs across thin overlaps or across gaps between contigs, and consistency of depth of coverage.</p>
</sec>
<sec id="s3c"><title>Data release and availability.</title>
<p>All the GOS protein predictions will be submitted to GenBank. In addition, all the data supporting this paper, including the clustering and the various analyses, will be made publicly available via the CAMERA project (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis; <ext-link ext-link-type="uri" xlink:href="http://camera.calit2.net">http://camera.calit2.net</ext-link>
), which is funded by the Gordon and Betty Moore Foundation.</p>
</sec>
<sec id="s3d"><title>All-against-all BLASTP search.</title>
<p>We used two sets of computer resources. At the J. Craig Venter Institute, 125 dual 3.06-GHz Xeon processor systems with 2 Gb of memory per system were used. Each system had 80 GB local storage and was connected by GBit ethernet with storage area network (SAN) I/O of ~24 GBit/sec and network attached storage (NAS) I/O of ~16 GBit/sec. A total of 466,366 CPU hours was used on this system. In addition, access to the National Energy Research Scientific Computing Center (NERSC) Seaborg computer cluster was available, including 380 nodes each with sixteen 375-MHz Power3 processors. The systems had between 16 GB and 64 GB of memory. Only 128 nodes were used at a time. A total of 588,298 CPU hours was used on this system. The dataset of 28.6 million sequences was searched against itself in a half-matrix using NCBI BLAST [<xref ref-type="bibr" rid="pbio-0050016-b038">38</xref>
] with the following parameters: -F “m L” -U T -p blastp -e 1 × 10<sup>−10</sup>
 -z 3 × 10<sup>9</sup>
 -b 8000 -v 10. In this paper, <italic>similarity</italic>
 of an alignment is defined to be the fraction of aligned residues with a positive score according to the BLOSUM62 substitution matrix [<xref ref-type="bibr" rid="pbio-0050016-b129">129</xref>
] used in the BLAST searches.</p>
</sec>
<sec id="s3e"><title>Identification of nonredundant sequences.</title>
<p>Given a set of sequences <italic>S</italic>
 and a threshold <italic>T</italic>
, a nonredundant subset <italic>S′</italic>
 of <italic>S</italic>
 was identified by first partitioning <italic>S</italic>
 (using the threshold <italic>T</italic>
) and then picking a representative from each partition. The set of representatives constitutes the nonredundant set <italic>S′</italic>
. The process was implemented using the following graph-theoretic approach. A directed graph <italic>G</italic>
 = (<italic>V, E</italic>
) is constructed with vertex set <italic>V</italic>
 and edge set <italic>E</italic>
. Each vertex in <italic>V</italic>
 represents a sequence from <italic>S</italic>
. A directed edge (<italic>u</italic>
,<italic>v</italic>
) ∈ <italic>E</italic>
 if sequence <italic>u</italic>
 is longer than sequence <italic>v</italic>
 and their sequence comparison satisfies the threshold <italic>T;</italic>
 for sequences of identical length, the sequence with the lexicographically larger id is considered the longer of the two. Note that <italic>G</italic>
 does not have any cycles. Source vertices (i.e., vertices with no in-degree) are sorted in decreasing order of their out-degrees and (from largest out-degree to smallest) processed in this order. A source vertex <italic>u</italic>
 is processed as follows: mark all vertices that have not been seen before and are reachable from vertex <italic>u</italic>
 as being redundant and mark vertex <italic>u</italic>
 as their representative.</p>
<p>We used two thresholds in this paper, 98% similarity and 100% identity. The former was used in the first stage of the clustering and the later was used in the HMM profile analysis. For the 98% similarity threshold, two sequences satisfy the threshold if the following three criteria are met: (1) similarity of the match is at least 98%; (2) at least 95% of the shorter sequence is covered by the match; and (3) (match score)/(self score of shorter sequence) ≥ 95%.</p>
<p>For the 100% identity threshold, two sequences satisfy the threshold if their match identity is 100%.</p>
</sec>
<sec id="s3f"><title>Description of the clustering algorithm.</title>
<p>The starting point for the clustering was the set of pairwise sequence similarities identified using the all-against-all BLASTP compute. Because of both the volume and nature of the data, the clustering was carried out in four steps: redundancy removal, core set identification, core set merging, and final recruitment.</p>
<p>A set of nonredundant sequences (at 98% similarity) was identified using the procedure given in <xref ref-type="sec" rid="s3">Materials and Methods</xref>
 (Identification of nonredundant sequences). Only the nonredundant sequences were considered in further steps of the clustering process.</p>
<p>The aim of the core set identification step was to identify <italic>core sets</italic>
 of highly related sequences. In graph-theoretic terms, this involves looking for dense subgraphs in a graph where the vertices correspond to sequences and an edge exists between two sequences if their sequence match satisfies some reasonable threshold (for instance, 40% similarity match over 80% of at least one sequence and are clearly homologous based on the BLAST threshold). Dense subgraphs were identified by using a heuristic. This approach utilizes <italic>long</italic>
 edges. These are edges where the match threshold is computed relative to the longer sequence. This was done to prevent, as much as possible, unrelated proteins from being put into the same core set. If all the sequences were full length, using long edges would have offered a good solution to keeping unrelated proteins apart. However, the situation here is complicated by the presence of a large amount of fragmentary sequence data of varying lengths. This was dealt with somewhat by working with rather stringent match thresholds and a two-stage process to identify the core sets. We used the concept of <italic>strict</italic>
 long edges and <italic>weak</italic>
 long edges. A strict long edge exists between two vertices (sequences) if their match has the following properties: (1) 90% of the longer sequence is involved in the match; (2) the match has 70% similarity; and (3) the score of the match is at least 60% of the self-score of the longer sequence. A weak long edge exists between two vertices (sequences) if their match has the following properties: (1) 80% of the longer sequence is involved in the match; (2) the match has 40% similarity; and (3) the score of the match is at least 30% of the self-score of the longer sequence. Core set identification had two substages: <italic>large core initialization</italic>
 and <italic>core extension.</italic>
 The large core initialization step identified sets of sequences where these sets were of a reasonable size and the sequences in them were very similar to each other. Furthermore, these sets could be extended in the core extension step by adding related sequences. In the large core initialization step, a directed graph <italic>G</italic>
 was constructed on the sequences using strict long edges, with each long edge being directed from the longer to the shorter sequence. For each vertex <italic>v</italic>
 in <italic>G</italic>
, let <italic>S</italic>
(<italic>v</italic>
) denote the friends set of <italic>v</italic>
 consisting of <italic>v</italic>
 and all neighbors that <italic>v</italic>
 has an out-going edge to.</p>
<p>Initially all the vertices in G are unmarked. Consider the set of all friends sets in the decreasing order of their size. For <italic>S</italic>
(<italic>v</italic>
) that is currently being considered, do the following: (1) initialize <italic>seed set A</italic>
 = <italic>S</italic>
(<italic>v</italic>
); (2) while there exists some <italic>v</italic>
′ such that |<italic>S</italic>
(<italic>v</italic>
) ∩ <italic>S</italic>
(<italic>v</italic>
′)| ≥ <italic>k</italic>
, set <italic>A</italic>
 = <italic>A</italic>
 ∪ <italic>S</italic>
(<italic>v</italic>
′). (Note: <italic>k</italic>
 = 10 is chosen); (3) output set <italic>A</italic>
 and mark all vertices in <italic>A</italic>
; and (4) update all friends sets to contain only unmarked vertices.</p>
<p>In the core extension step, we constructed a graph <italic>G</italic>
 using weak long edges. All vertices in seed sets (computed from the large core initialization step) were marked and the rest of the vertices unmarked. Each seed set was then greedily extended to be a core set by adding a currently unmarked vertex that has at least <italic>k</italic>
 neighbors (<italic>k</italic>
 = 10 is chosen) in the set; the added vertex was marked. After this process, a clique-finding heuristic was used to identify smaller cliques (of size at most <italic>k</italic>
 − 1) consisting of currently unmarked vertices; these were also extended to become core sets. A final step involved merging the computed core sets on the basis of weak edges connecting them.</p>
<p>In the core set merging step, we constructed an FFAS (Fold and Function Assignment System) profile [<xref ref-type="bibr" rid="pbio-0050016-b039">39</xref>
] for each core set using the longest sequence in the core set as query. FFAS was then used to carry out profile–profile comparisons in order to merge the core sets into larger sets of related sequences. Due to computational constraints imposed by the number of core sets, profiles were built on only core sets containing at least 20 sequences.</p>
<p>Final recruitment involved constructing a PSI-BLAST profile [<xref ref-type="bibr" rid="pbio-0050016-b040">40</xref>
] on core sets of size 20 or more (using the longest sequence in the core set as query) and then using PSI-BLAST (–z 1 × 10<sup>9</sup>
, –e 10) to recruit as yet unclustered sequences or small-sized clusters (size less than 20) to the larger core sets. For a sequence to be recruited, the sequence–profile match had to cover at least 60% of the length of the sequence with an <italic>E-</italic>
value ≤ 1 × 10<sup>−7</sup>
. In a final step, unclustered sequences were recruited to the clusters using their BLAST search results. A length-based threshold was used to determine if the sequence is to be recruited.</p>
</sec>
<sec id="s3g"><title>Identification of clusters containing shadow ORFs.</title>
<p>A well-known problem in predicting coding intervals for DNA sequences is shadow ORFs. The key requirement that coding intervals not contain in-frame stop codons requires that coding intervals be subintervals of ORFs. Long ORFs are therefore obvious candidates to be coding intervals. Unfortunately, the constraints on the coding interval to be an ORF often cause subintervals and overlapping intervals of the coding interval to also be ORFS in one of the five other reading frames (two on the same strand and three on the opposite strand). These coincidental ORFs are called shadow ORFs since they are found in the shadow of the coding ORF. In rare cases (and more frequently in certain viruses) coding intervals in different reading frames can overlap but usually only slightly. Overwhelmingly distinct coding intervals do not overlap. However, this constraint is not as strict for ORFs that contain a coding interval, as the exact extent of the coding interval is not known. Prokaryotes predominate in these data and are the focus of the ORF predictions. Their 3′ end of an ORF is very likely to be part of the coding interval because a stop codon is a clear signal for the termination of both the ORF and the coding interval (this signal could be obscured by frameshift errors in sequencing). The 5′ end is more problematic because the true start codon is not so easily identified and so the longest ORF with a reasonable start codon is chosen and this may extend the ORF beyond the true coding interval. For this reason different criteria were set for when ORFs have a significant overlap depending on the orientation (or the 5′ or 3′ ends) of the ORFs involved. Two ORFs on the same strand are considered overlapping if their intervals overlap by at least 100 bp. Two ORFs that are on the opposite strands are considered overlapping either if their intervals overlap by at least 50 bp and their 3′ ends are within each others intervals, or if their intervals overlap by at least 150 bp and the 5′ end of one is in the interval of the other.</p>
<p>ORFs for coding intervals are clustered based on sequence similarity. In most cases this sequence similarity is due to the ORFs evolving from a common ancestral sequence. Due to functional constraints on the protein being coded for by the ORF, some sequence similarity is retained. There are no known explicit constraints on the shadow ORFs to constrain drift from the ancestral sequences. However, the shadow ORFs still tend to cluster together for some obvious reasons. The drift has not yet obliterated the similarity. There are implicit constraints due to the functional constraints on the overlapping coding ORF. There are also other possible unknown functional constraints beyond the coding ORF. At first it was surmised that within shadow ORF clusters the diversity should be higher than for the coding ORF, but this did not prove to be a reliable signal. The apparent problem is that the shadow ORFs tend to be fractured into more clusters due to the introduction of stop codons that are not constrained because the shadow ORFs are noncoding. What rapidly became apparent is that the most reliable signal that a cluster was made up of shadow ORFs is that the cluster was smaller than the coding cluster containing the ORFs overlapping the shadow ORFs.</p>
<p>The basic rule for labeling a cluster as a shadow ORF cluster is that the size of the shadow ORF cluster is less than the size of another cluster that contained a significant proportion of the overlapping ORFs for the shadow ORF cluster. A specific set of rules was used to label shadow ORF clusters based on comparison to other clusters that contained ORFs overlapping ORFS in the shadow ORF cluster (called the overlapping cluster for this discussion). First, the overlapping cluster cannot be the same cluster as the shadow ORF cluster (there are sometimes overlapping ORFs within the same cluster due to frameshifts). Second, both the redundant and nonredundant sizes of the shadow ORF cluster must be smaller than the corresponding sizes of the overlapping cluster. Third, at least one-third of the shadow ORFs must have overlapping ORFs in the overlapping cluster. Fourth, less than one-half of the shadow ORFs are allowed to contain their overlapping ORFs (this test is rarely needed but did eliminate the vast majority of the very few obvious false positives that were found using these rules). Finally, the majority of the shadow ORFs that overlapped must overlap by more than half their length.</p>
<p>When using this rule, 1,274,919 clusters were labeled as shadow ORF clusters, and 6,570,824 singletons were labeled as shadow ORFs. The rules need to be somewhat conservative so as not to eliminate coding clusters. To test these rules, clusters containing at least two NCBI-nr sequences were examined. Two sequences were used instead of one because occasional spurious shadow ORFs have been submitted to NCBI-nr. There were 989 shadow ORF clusters containing at least two NCBI-nr sequences and with more than one-tenth as many NCBI-nr sequences as the overlapping cluster. This was 0.86% of all clusters (114,331 in total) with at least two NCBI-nr sequences. Of these 989, a few were obvious mistakes, and the others involved very few NCBI-nr sequences of dubious curation, such as “hypothetical.” Just to be conservative, all of these 989 clusters were rescued and not labeled as shadow ORF clusters.</p>
</sec>
<sec id="s3h"><title>Ka/Ks test to determine if sequences in a cluster are under selective pressure.</title>
<p>For a cluster containing conserved but noncoding sequences, it is expected that there is no selection at the codon level. We checked this by computing the ratio of nonsynonymous to synonymous substitutions (Ka/Ks test) [<xref ref-type="bibr" rid="pbio-0050016-b130">130</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b131">131</xref>
] on the DNA sequences from which the ORFs in the cluster were derived. For most proteins, Ka/Ks ≪ 1, and for proteins that are under strong positive selection, Ka/Ks ≫ 1. A Ka/Ks value close to 1 is an indication that sequences are under no selective pressure and hence are unlikely to encode proteins [<xref ref-type="bibr" rid="pbio-0050016-b134">134</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b135">135</xref>
]. Weakly selected but legitimate coding sequences can have a Ka/Ks value close to 1. These were identified to some extent by using a model in which different partitions of the codons experience different levels of selective pressure. A cluster was rejected only if no partition was found to be under purifying selection at the amino acid level.</p>
<p>The Ka/Ks test [<xref ref-type="bibr" rid="pbio-0050016-b130">130</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b131">131</xref>
] was run only on those clusters (remaining after the shadow ORF filtering step) that did not contain sequences with HMM matches or have NCBI-nr sequences in them. Only the nonredundant sequences in a cluster were considered. Sequences in each of the clusters were aligned with MUSCLE [<xref ref-type="bibr" rid="pbio-0050016-b134">134</xref>
]. For each cluster, a strongly aligning subset of sequences was selected for the Ka/Ks analysis. The codeml program from PAML [<xref ref-type="bibr" rid="pbio-0050016-b135">135</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b136">136</xref>
] was run using model M0 to calculate an overall (i.e., branch- and position-independent) Ka/Ks value for the cluster. Clusters with Ka/Ks ≤ 0.5, indicating purifying selection and therefore very likely coding, were considered as passing the Ka/Ks filter. In addition, the remaining clusters were examined by running codeml with model M3. This partitioned the positions of the alignment into three classes that may be evolving differently (typically, a few positions may be under positive selection while the remainder of the sequence is conserved). A likelihood ratio test was applied to select clusters for which M3 explained the data significantly better than M0 [<xref ref-type="bibr" rid="pbio-0050016-b136">136</xref>
]. If a cluster was thus selected, and if one of the resulting partitions had a Ka/Ks ≤ 0.5 and comprised at least 10% of the sequence, then that cluster was also considered as passing the Ka/Ks filter. All other clusters were marked as containing spurious ORFs.</p>
</sec>
<sec id="s3i"><title>Statistics for the various stages of the clustering process</title>
<p>The number of sequences that remain after redundancy removal (at 98% similarity) for each dataset is given in <xref ref-type="table" rid="pbio-0050016-t012">Table 12</xref>
. Recall that the size of a cluster is the number of nonredundant sequences in it.</p>
<p>Number of core sets of size two or more totals 1,586,454; number of nonredundant sequences in core sets of size two or more totals 8,337,256; and total number of sequences in core sets of size two or more is 12,797,641.</p>
<p>Total number of clusters after profile merging and (PSI-BLAST and BLAST) recruitment is 1,871,434; number of clusters of size two or more totals 1,388,287; number of nonredundant sequences in clusters of size two or more totals 11,494,078; total number of sequences in clusters of size two or more is 16,565,015.</p>
<p>The final clustering statistics (after shadow ORF detection and Ka/Ks tests) are as follows: number of clusters of size two or more totals 297,254; number of nonredundant sequences in clusters of size two or more totals 6,212,610; total number of sequences in clusters of size two or more is 9,978,637.</p>
<p>In the final BLAST recruitment step, a pattern was seen involving highly compositionally biased sequences that recruited unrelated sequences to clusters. This was reflected in the pre- and post-BLAST recruitment numbers, where the postrecruitment sizes were more than three to four times the size of the prerecruitment numbers. There were 75 such clusters, and these were removed.</p>
</sec>
<sec id="s3j"><title>Searching sequences using profile HMMs.</title>
<p>The full set of 7,868 Pfam release 17 models was used, along with additional nonredundant profiles from TIGRFAM (1,720 of 2,443 profiles; version 4.1). HMM profiling was carried out using a TimeLogic DeCypher system (Active Motif, Inc., <ext-link ext-link-type="uri" xlink:href="http://www.activemotif.com">http://www.activemotif.com</ext-link>
) and took 327 hours in total (on an eight-card machine). A sequence was considered as matching a Pfam (fragment model) if its sequence score was above the TC score for that Pfam and had an <italic>E-</italic>
value ≤ 1 × 10<sup>−3</sup>
. It was considered as matching a TIGRFAM if the match had an <italic>E-</italic>
value ≤ 1 × 10<sup>−7</sup>
.</p>
</sec>
<sec id="s3k"><title>Evaluation of protein prediction via clustering.</title>
<p>Our evaluation of protein prediction via the clustering shows a very favorable comparison to currently used protein prediction methods for prokaryotic genomes. We used the PG dataset for this evaluation (<xref ref-type="table" rid="pbio-0050016-t002">Table 2</xref>
). Of the 3,049,695 PG ORFs, 575,729 sequences (19%) were clustered (the <italic>clustered set</italic>
). Of the 614,100 predictions made by the genome projects, 600,911 sequences could be mapped to the PG ORF set (the <italic>submitted set</italic>
); 93% of the unmapped sequences were <60 aa (recall that the ORF calling procedure only produced ORFs of length ≥60 aa). The clustered set and submitted set had 493,756 ORFs in common. Of the 107,155 sequences that were only in the submitted set, 24,217 sequences (23%) had HMM matches. As with other unclustered HMM matches, most were weak or partial. These sequences had an average of only 48% of their lengths covered by HMMs. Of the remaining 82,938 sequences that did not have an HMM match, 13,724 (17%) were removed by the filters used, and the rest fell into clusters with only one nonredundant sequence (and thus were not labeled as predicted proteins by the clustering analysis). Based on NCBI-nr sequences in them, these clusters were mostly labeled as “hypothetical,” “unnamed,” or “unknown.” Our clustering method identified 81,973 ORFs not predicted by the genome projects, of which 16,042 (20%) were validated by HMM matches (with average HMM coverage of 69% of sequence length) and an additional 27,120 (33%) had significant BLAST matches (<italic>E-</italic>
value ≤ 1 × 10<sup>−10</sup>
) to sequences in NCBI-nr. Thus, if the submitted set is considered as truth, then protein prediction via clustering produces 493,756 true positives (TP), 81,973 false positives (FP), and 107,155 false negatives (FN), thereby having a sensitivity (TP/[TP + FN]) of 83% and specificity (TP/[TP + FP]) of 86%. However, if truth is considered as those sequences that are common to both the clustered and submitted sets in addition to those sequences with HMM matches, then our protein prediction method via clustering has 95% sensitivity and 89% specificity, while protein prediction by the prokaryotic genome projects has 97% sensitivity and 86% specificity.</p>
</sec>
<sec id="s3l"><title>Evaluation of protein clustering.</title>
<p>We used Pfams to evaluate the clustering method in two ways. For both evaluations the clustering was restricted to only those sequences with Pfam matches. It should be kept in mind that there are redundancies among Pfams in that there can be more than one Pfam for a homologous domain family (for instance, the kinase domain Pfams—PF00069 protein kinase domain and PF07714 protein tyrosine kinase), and these redundancies can affect the evaluation statistics reported below.</p>
<p>For the first evaluation, each sequence was represented by the set of Pfams that match it. This is referred to as the <italic>domain architecture</italic>
 for a sequence. While Pfams provide a domain-centric view of proteins, the domain architecture attempts to approximate the full sequence-based approach used here, and thus could be used to shed light on the general performance of the clustering. We measured how often unrelated sequences were present in a given cluster. Two sequences were defined to be unrelated if their domain architectures each had at least one Pfam that was not present in the other's domain architecture. Note that this measure did not penalize the case when the domain architecture of one sequence was a proper subset of the domain architecture of the other sequence. This was done to allow fragmentary sequences in clusters to be included in the evaluation as well (and also because it is not always easy to determine whether an amino acid sequence is fragmentary or not). For each cluster, we computed the percentage of sequence pairs that are unrelated under this measure. A total of 92% of the clusters had at most 2% unrelated pairs. Then we carried out an assessment of how many instances of a given domain architecture appear in a single cluster. A total of 58% of the domain architectures were confined to single clusters (i.e., 100% of their occurence is in one cluster), and 88% of the domain architectures was such that >50% of their occurences is in one cluster.</p>
<p>For the second evaluation, we selected all sequences with Pfam matches, and each sequence was assigned to the Pfam that matches it with the highest score. With this assignment, the Pfams induce a partition on the sequences. The distribution of the number of sequences in clusters induced by the Pfams was compared to those of clusters from the clustering method. <xref ref-type="fig" rid="pbio-0050016-g012">Figure 12</xref>
A shows comparison as a log–log plot of the number of sequences versus the number of clusters with at least that many sequences for the two cases. The plot shows that cluster size distributions are quite similar, with both the methods having an inflection point around 2,500. The difference between the two curves is that there are more big clusters (and also fewer small clusters) induced by the Pfams as compared to the clustering method. This can be explained by noting that two sequences that are in the same Pfam cluster can nevertheless be put into different clusters by the clustering method if they differ in their remaining portions.</p>
<p>Our clustering also shows a good correspondence with HMM profiling on the phylogenetic markers that we looked at. The clustering identifies 7,423, 12,553, and 13,657 sequences, respectively, for RecA (cluster ID 1146), Hsp70 (cluster ID 197), and RpoB (cluster ID 1187). HMM profiling identifies 5,292, 12,298, and 12,165 sequences, respectively, for these families. For each of these families, there are at least 94% of sequences (relative to the smaller set) in common between clustering and HMM profiling.</p>
</sec>
<sec id="s3m"><title>Difference in ratio of predicted proteins to total ORFs for the PG set and the GOS set.</title>
<p>The ratio of clustered ORFs to total ORFs is significantly higher for the GOS ORFs (0.3471) compared to the PG ORFs (0.1888). This can be explained by the fragmentary nature of the GOS data. For the large majority of the GOS data, the average sequence length is 920 bp compared to full-length genomes for the PG data. For the PG data, clustered ORFs have a mean length of 325 aa and a median length of 280 aa. Unclustered ORFs have a mean length of 119 aa and a median length of 87 aa. Assuming that the genomic GOS data has a similar underlying ORF structure to PG data, the effect that GOS fragmentation had on ORF lengths is estimated. Each reading frame will have a mixture of clustered and unclustered ORFs, but on average there will be 2 ORFs per reading frame per 920-bp GOS fragment, and both ORFs will be truncated. Assuming the truncation point for the ORF is uniformly distributed across the ORF, the truncated ORF will drop below the 60-aa threshold to be considered as an ORF with a probability of 60/(length of the ORF). Using the median length, the percentage of clustered ORFs dropping below the threshold due to truncation is 21%; for unclustered ORFs, it is 69%. Accounting for this truncation, the expected ratio of clustered ORFs to total ORFs for the GOS ORFs based on the PG ORFs would be 0.3708, which is very close to the observed value.</p>
</sec>
<sec id="s3n"><title>Kingdom assignment strategy and its evaluation.</title>
<p>We used several approaches to assign kingdoms for GOS sequences. They are all fundamentally based upon a strategy that takes into account top BLAST matches of a GOS sequence to sequences in NCBI-nr, and then voting on a majority.</p>
<p>We evaluated a simple strict-majority voting scheme (of the top four BLAST matches) using the NCBI-nr set. First, the redundancy in NCBI-nr was removed using a two-staged process. A nonredundant set of NCBI-nr sequences was computed involving matches with 98% similarity over 95% of the length of the shorter sequence (using the procedure discussed in <xref ref-type="sec" rid="s3">Materials and Methods</xref>
 [Identification of nonredundant sequences]). This set was made further nonredundant by considering matches involving 90% similarity over 95% of the length of the shorter sequence. The nonredundant sequences that remained after this step constituted the evaluation dataset <italic>S</italic>
. For each sequence in <italic>S</italic>
, its top four BLAST matches to other sequences in <italic>S</italic>
 (ignoring self-matches) were used to assign a kingdom for it (based on a strict majority rule). This predicted kingdom assignment for the sequence was compared to its actual kingdom. A correct classification is obtained for 93% of the sequences. The correct classification rate per kingdom is given in <xref ref-type="table" rid="pbio-0050016-t013">Table 13</xref>
.</p>
<p>While this evaluation shows that the BLAST-based voting scheme provides a reasonable handle on the kingdom assignment problem, there are caveats associated with it. The kingdom assignment for a set of query sequences is greatly influenced by the taxonomic groups from each kingdom that are represented in the reference dataset against which these queries are being compared. If certain taxa are only sparsely represented in the reference set, then, depending on their position in the tree of life, queries from these taxa can be misclassified (using a nearest-neighbor type approach based on BLAST matches). This explains why the archaeal classification rate is quite low compared to the others. Thus, the true classification rate for the GOS dataset based on this approach will also depend on the differences in taxonomic biases in the GOS dataset (query) and the NCBI-nr set (reference).</p>
<p>The kingdom proportion for the GOS dataset reported in <xref ref-type="fig" rid="pbio-0050016-g001">Figure 1</xref>
 is based on a kingdom assignment of scaffolds. Those GOS ORFs with BLAST matches to NCBI-nr were considered, and the top-four majority rule was used to assign a kingdom to each of them. Using the ORF coordinates on the scaffold, the fraction (of bp) of a scafffold assigned to each kingdom was computed. The scaffold was labeled as belonging to a kingdom if the fraction of the scaffold assigned to that kingdom was >50%. All ORFs on this scaffold were then assigned to the same kingdom.</p>
</sec>
<sec id="s3o"><title>Cluster size distribution, the power law, and the rate of protein family discovery.</title>
<p>Earlier studies of protein family sizes in single organisms [<xref ref-type="bibr" rid="pbio-0050016-b137">137</xref>
–<xref ref-type="bibr" rid="pbio-0050016-b139">139</xref>
] have suggested that <italic>P</italic>
(<italic>d</italic>
), the frequency of protein families of size <italic>d,</italic>
 satisfies a power law: that is, <italic>P</italic>
(<italic>d</italic>
) ≈ <italic>d <sup>−</sup>
</italic>
 <sup>β</sup>
 with exponent β reported between 2.68 and 4.02. Power laws have been used to model various biological systems, including protein–protein interaction networks and gene regulatory networks [<xref ref-type="bibr" rid="pbio-0050016-b042">42</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b043">43</xref>
]. <xref ref-type="fig" rid="pbio-0050016-g012">Figure 12</xref>
B illustrates the distribution of the cluster sizes from our data on a log–log scale, a scale for which a power law distribution gives a line. In contrast to family size distributions reported in single organisms, the cluster sizes from our data are not well described by a single power law. Rather, there appear to be different power laws: one governs the size distribution of very large clusters, and another describes the rest. This behavior is observed both in the distribution of the core set sizes and also in the distribution of the final cluster sizes. We identified an inflection point for both the core set distribution and the final clusters at around size 2,500, and estimated the power law exponent β via linear regression separately in each size regime. For the core set distribution, the exponent β = 1.99 (<italic>R</italic>
<sup>2</sup>
 = 0.994) for clusters of size ≤ 2,500, and β = 3.34 (<italic>R</italic>
<sup>2</sup>
 = 0.996) for clusters of size > 2,500. For the final cluster sizes, the exponent β = 1.72 (<italic>R</italic>
<sup>2</sup>
 = 0.995) for clusters of size ≤ 2,500, and β = 2.72 (<italic>R</italic>
<sup>2</sup>
 = 0.995) for clusters of size > 2,500. The estimates for β are different for the core clusters compared to the final clusters, reflecting a larger number of medium and large clusters in the final clustering as a result of the cluster-merging and additional recruitment steps. A similar dichotomy between the size distributions of large and small protein families was observed in a study [<xref ref-type="bibr" rid="pbio-0050016-b140">140</xref>
] of protein families contained in the ProDom, Protomap, and COG databases, where the exponent β reported was in the range of 1.83 to 1.98 for the 50 smallest clusters and 2.54 to 3.27 for the 500 largest clusters in these databases.</p>
<p>Our clustering method was run separately on the following seven datasets: set 1 consisted of only NCBI-nr sequences; set 2 consisted of all sequences in NCBI-nr, ENS, TGI-EST, and PG; sets 3 through 6 consisted of set 2 in combination with a random subset of 20%, 40%, 60%, and 80% of the GOS sequences, respectively; set 7 consisted of set 2 in combination with all the GOS sequences. On each of the seven datasets, the redundancy removal (using the 98% similarity filter) was run, followed by the core set detection steps. <xref ref-type="fig" rid="pbio-0050016-g002">Figure 2</xref>
 shows the number of core sets of varying sizes (≥3, ≥5, ≥10, and ≥20) as a function of the number of nonredundant sequences for each dataset.</p>
<p>The observed linear growth in number of families with increase in sample size <italic>n</italic>
 is related to the power law distribution in the following way. We model protein families as a graph where each vertex corresponds to a protein sequence and an edge between two vertices indicates sequence similarity between the corresponding proteins. Consider a clustering (partitioning) of the vertices of a graph with <italic>n</italic>
 vertices such that the cluster sizes obey a power law distribution. Let C<italic><sub>d</sub>
</italic>
(<italic>n</italic>
) [respectively, C<sub>≥<italic>d</italic>
</sub>
(<italic>n</italic>
)] denote the number of clusters of size <italic>d</italic>
 (respectively, ≥<italic>d</italic>
). Since the distribution of cluster sizes follows a power law, there exist constants α<italic>,</italic>
 β such that for all <italic>x</italic>
 ≤ <italic>n</italic>
, <italic>C<sub>x</sub>
</italic>
(<italic>n</italic>
) = α<italic>x<sup>−</sup>
</italic>
<sup>β</sup>
.</p>
<p>As every vertex of the graph is a member of exactly one cluster,
					<disp-formula id="pbio-0050016-e001"><graphic xlink:href="pbio.0050016.e001.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
The number of clusters of size at least <italic>d</italic> is
					<disp-formula id="pbio-0050016-e002"><graphic xlink:href="pbio.0050016.e002.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
Combining the two equations, we obtain values (up to a multiplicative constant) for <italic>C</italic>
<sub>≥<italic>d</italic>
</sub>
(<italic>n</italic>
) as shown in <xref ref-type="table" rid="pbio-0050016-t014">Table 14</xref>
. In all cases with β > 1, the number of clusters <italic>C</italic>
<sub>≥<italic>d</italic>
</sub>
(<italic>n</italic>
) increases as <italic>n</italic>
 increases, and as <italic>d</italic>
 decreases. Specifically, for β > 2, the growth is linear in <italic>n</italic>
 for all <italic>d</italic>
, with slope decreasing as <italic>d</italic>
 increases. For 1 < β < 2, the growth is sublinear in <italic>n</italic>
 for all <italic>d</italic>.
				</p>
<p>Note that while the observed distribution of protein family sizes is fit by two different power laws, one for clusters of size less than 2,500 with β = 1.99 and another for clusters of size greater than 2,500 with β = 3.34 for the current number of (nonredundant) sequences, the contribution of large families to the rate of growth is negligible compared to the small families.</p>
<p>The above formulas for <italic>C</italic>
<sub>≥<italic>d</italic>
</sub>
(<italic>n</italic>
) also suggest the dependence of the rate of growth of clusters on the cluster size <italic>d</italic>. For example, in the case when β is very close to 2,
					<disp-formula id="pbio-0050016-e003"><graphic xlink:href="pbio.0050016.e003.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
for some constant <italic>m</italic>
. Thus, the rate of growth of cluster sizes is linear, and the slope <italic>m</italic>
(<italic>d</italic>
) of rate of growth is given by <italic>m</italic>
(<italic>d</italic>
) <italic>= md</italic>
<sup>1−β</sup>
. <xref ref-type="fig" rid="pbio-0050016-g013">Figure 13</xref>
 shows how well the observed rates of growth match the values predicted by this equation. A fit to a sublinear function (not shown) also gives similar results as in <xref ref-type="fig" rid="pbio-0050016-g013">Figure 13</xref>.
				</p>
</sec>
<sec id="s3p"><title>GOS versus known prokaryotic versus known nonprokaryotic.</title>
<p>Examples of top five clusters in the various categories (except GOS-only) are given below. The cluster identifiers are in parentheses.</p>
<p>Known prokaryotic only: (Cluster ID 1319) outer surface protein in <italic>Anaplasma ovis, Wolbachia, Ehrlichia canis;</italic>
 (Cluster ID 10911) nitrite reductase in uncultured bacterium; (Cluster ID 1266) outer membrane lipoprotein in <italic>Borrelia;</italic>
 (Cluster ID 8595) methyl-coenzyme M reductase subunit A in uncultured archaeon; (Cluster ID 2959) outer membrane protein in <italic>Helicobacter.</italic>
 Known nonprokaryotic only: (Cluster ID 2226) Pol polyprotein HIV sequences; (Cluster ID 4023) maturase K; (Cluster ID 6257) NADH dehydrogenase subunit 2; (Cluster ID 8644) HIV protease; (Cluster ID 12196) MHC class I and II antigens. GOS and known prokaryotic only: (Cluster ID 3369) carbamoyl transferase; (Cluster ID 688) apolipoprotein N-acyltransferase; (Cluster ID 3726) potassium uptake proteins; (Cluster ID 300) primosomal protein N′; (Cluster ID 4605) DNA polymerase III delta subunit. GOS and known nonprokaryotic only: (Cluster ID 186) seven transmembrane helix receptors; (Cluster ID 2069) zinc finger proteins; (Cluster ID 3092) MAP kinase; (Cluster ID 1413) potential mitochondrial carrier proteins; (Cluster ID 233) pentatricopeptide (PPR) repeat-containing protein. Known prokaryotic and known nonprokaryotic only: (Cluster ID 3510) immunoglobulin (and immunoglobulin-binding) proteins; (Cluster ID 600) expansin; (Cluster ID 50) pectin methylesterase; (Cluster ID 6492) lectin; (Cluster ID 986) BURP domain-containing protein. GOS and known prokaryotic and known nonprokaryotic: (Cluster ID 2568) ABC transporters; (Cluster ID 49) short-chain dehydrogenases; (Cluster ID 4294) epimerases; (Cluster ID 1239) AMP-binding enzyme; (Cluster ID 2630) envelope glycoprotein.</p>
</sec>
<sec id="s3q"><title>Neighbor functional linkage methods.</title>
<p>For the sequences in each GOS-only cluster, we determined if neighboring ORFs occurring on the same strand had a similar biological process in the GO [<xref ref-type="bibr" rid="pbio-0050016-b049">49</xref>
]. If this shared biological process of the neighbors occurred statistically more often than expected by chance, that inferred a potential operon linkage and a biological process term for the GOS-only cluster. This approach weighted ORFs by sequence similarity to reduce the skewing effect of sequences from highly related organisms.</p>
<p>For definition of linked ORFs, we collected pairs of same-strand ORF protein predictions with intergenic distances less than 500 bp. Negative distances were possible if the 5′ end of the downstream ORF in the pair occurred 5′ to the 3′ end of the upstream ORF. We used a probability function to estimate the probability that two putative genes belong to the same operon given their intergenic distance [<xref ref-type="bibr" rid="pbio-0050016-b047">47</xref>
]. Because sequences come from a variety of unknown organisms, the probability distribution was created by averaging properties of 33 randomly chosen divergent genomes. The exact choice of genomes did not greatly affect the ability of the distribution to separate experimentally determined same-operon gene pairs from adjacent, same-strand gene pairs in different known operons annotated in a version of RegulonDB downloaded on March 29, 2005 [<xref ref-type="bibr" rid="pbio-0050016-b141">141</xref>
].</p>
<p>We measured the functional linkage between two protein clusters by searching for all occurrences of nearby pairs of ORFs belonging to the two clusters of interest. Sufficiently close pairs were more likely to be encoded in the same operon. We devised a scoring mechanism to reward those pairs of clusters for which many divergent examples of likely operon pairs existed in the set of ORF pairs. For each pair of clusters, a weight was applied to the contribution of each pair of ORFs, and this was proportional to how similar the pair of ORFs was to other example pairs. Thus, many near-identical pairs of ORFs, likely from the same or similar species, are not overrepresented in the final cluster pair score, while conserved examples of neighboring position from more divergent sequences contribute an increased weight. The score for each cluster pair is calculated as:
					<disp-formula id="pbio-0050016-e004"><graphic xlink:href="pbio.0050016.e004.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
where <italic>S</italic>
(C<sub>1</sub>
C<sub>2</sub>
) is the linkage score of clusters C<sub>1</sub>
 and C<sub>2</sub>
. The probability <inline-formula id="pbio-0050016-ex001"><inline-graphic xlink:href="pbio.0050016.ex001.jpg" mimetype="image"></inline-graphic>
</inline-formula>
that any two genes <inline-formula id="pbio-0050016-ex002"><inline-graphic xlink:href="pbio.0050016.ex002.jpg" mimetype="image"></inline-graphic>
</inline-formula>
from C<sub>1</sub>
 and <inline-formula id="pbio-0050016-ex003"><inline-graphic xlink:href="pbio.0050016.ex003.jpg" mimetype="image"></inline-graphic>
</inline-formula>
from C<sub>2</sub>
 are in an operon is dependent on the distance between them as calculated by [<xref ref-type="bibr" rid="pbio-0050016-b047">47</xref>
], and is weighted according to the sequence weights <inline-formula id="pbio-0050016-ex004"><inline-graphic xlink:href="pbio.0050016.ex004.jpg" mimetype="image"></inline-graphic>
</inline-formula>
and <inline-formula id="pbio-0050016-ex005"><inline-graphic xlink:href="pbio.0050016.ex005.jpg" mimetype="image"></inline-graphic>
</inline-formula>
described below, for all example pairs <italic>i</italic>.
				</p>
<p>We calculated sequence weights in a manner similar to that used in progressive multiple sequence alignment [<xref ref-type="bibr" rid="pbio-0050016-b142">142</xref>
]. Briefly, neighbor-joining trees were built for all clusters using the QuickJoin [<xref ref-type="bibr" rid="pbio-0050016-b143">143</xref>
] and QuickTree programs [<xref ref-type="bibr" rid="pbio-0050016-b144">144</xref>
] based on a distance matrix constructed from all-against-all BLAST scores within a cluster, normalized to self-scores. For those few clusters with more than 30,000 members, trees were not built. Instead, equal sequence weights for all members were assigned because of computational limitations. The root of each tree was placed at the midpoint of the tree by using the retree package in PHYLIP [<xref ref-type="bibr" rid="pbio-0050016-b145">145</xref>
]. The individual sequence weights were then computed by summing the distance from each leaf to the root after dividing each branch's weight by the number of nodes in the subtree below it. Weights were normalized so that the sum of weights in any given tree was equal to 1.0. This weighting scheme is superior to one in which weights are normalized to the largest weight in the tree, one that does not weight sequences according to divergence, and one that only considers the number of example pairs seen (<xref ref-type="fig" rid="pbio-0050016-g014">Figure 14</xref>
). To compare the different scoring methods, pairs of clusters annotated with GO terms that contained adjacent ORFs in the data were gathered. These pairs were divided into into functionally related and unrelated clusters based on a measure of GO term similarity (<italic>p</italic>
-value ≤ 0.01) [<xref ref-type="bibr" rid="pbio-0050016-b146">146</xref>
]. We evaluated scoring methods for the ability to recover functionally similar pairs. In all analyses, linkages between clusters were ignored if there were fewer than five examples of cluster member ORFs adjacent to each other on a scaffold.</p>
<p>Function for novel families was inferred as follows. (1) Assignment of GO terms to clusters. We downloaded the GO [<xref ref-type="bibr" rid="pbio-0050016-b049">49</xref>
] database on September 21, 2005, from <ext-link ext-link-type="uri" xlink:href="http://www.geneontology.org">http://www.geneontology.org</ext-link>
, along with the files gene_association.goa_uniprot and pfam2go.txt dated July 12, 2005. Only the biological process component of the ontology was considered. If a cluster had at least 10% of its redundant sequences annotated by the most abundant Pfam domain for that cluster, and that Pfam domain had a GO biological process term provided by the pfam2go mapping, then we assigned a cluster the GO term of its most abundant Pfam annotation. In addition, if a cluster contained at least 20% of its Uniprot GO annotations the same, it was assigned that GO term. For each cluster, redundant GO terms found on the same path to the root were removed. (2) Identification of neighbors to GOS-only clusters. Neighbors of GOS-only clusters were defined as those clusters that had a cluster linkage score above a predetermined threshold (1 × 10<sup>−6</sup>
) and had at least five examples of cluster members adjacent to each other in the data. These neighbors were then screened for those that had been annotated with a GO term by the process described above. (3) Overrepresentation of neighbor GO terms. We attempted to define GO terms for a set of GOS-only neighbors that were statistically overrepresented. Because of the highly dependent nature of the terms in the GO, a simulation-based approach was chosen to determine which terms might be overrepresented. Annotated neighbors to a cluster of unknown function were identified as described above. For each annotated neighbor, counts for the associated GO term and all terms on the path to the root of the ontology were incremented. A total of 100,000 simulated neighbor lists of the same size as the true neighbor list were computed by selecting without replacement from those clusters with annotated GO terms, and an identical counting scheme was performed for each simulation. Overrepresentation of neighbor terms was calculated for each term on the ontology by asking how many times out of the 100,000 simulations the count for each GO term in the ontology met or exceeded the observed count for the actual neighbors. This fraction of simulations was interpreted as a <italic>p-</italic>
value. If a term is unusually prevalent in the true observed neighbors, it should be relatively infrequent in the simulated data. For the purpose of the metric used here, “is-a” and “part-of” relationships were treated equally. In cases where a cluster had more than one GO term assigned to it, any redundant terms occurring on each other's path to the root were first removed. For any remaining clusters with nonredundant, multiple GO annotations, all possible lists of functions for each list of neighbor clusters were enumerated, and one function from each cluster was chosen. Each node in the ontology was assigned the maximum count observed from the enumerated function lists. We consistently applied this rule for the observed and simulated data.</p>
<p>The following descriptive measures of the novel GOS-only cluster set were obtained. Transmembrane helix prediction was carried out with the programs TMHMM [<xref ref-type="bibr" rid="pbio-0050016-b147">147</xref>
] and SPLIT4 [<xref ref-type="bibr" rid="pbio-0050016-b148">148</xref>
]. GC content was calculated as (G + C)/(G + C + A + T) bases for each ORF in a cluster, and averaged for each cluster within a set. The GC content, reported as the mean and standard deviation of the cluster averages, is as follows for each cluster set: Group I, 36.7% ± 8.0%; Group II, 35.9% ± 7.9%. Group I size-matched sample, 48.8% ± 11.1%; Group II size-matched sample, 49.5% ± 11.2%; Group I viral fraction, 37.8% ± 5.1%; Group II viral fraction, 37.3% ± 4.6%. To address the interconnectivity of the novel clusters within the context of all operon linkages, we constructed a graph with clusters as nodes and inferred operon linkages (with score ≥ 1 × 10<sup>−6</sup>
) as edges. We then asked for every node in the set of novel clusters what was the cumulative fraction of novel nodes that could be reached within a varying edge distance from the starting node. The expectation of this fraction was calculated at each distance, and the procedure was repeated for the set of size-matched clusters (<xref ref-type="fig" rid="pbio-0050016-g015">Figure 15</xref>
).</p>
<p>We tried three different BLAST-based approaches for kingdom assignment of ORFs. The first method, used in the analysis, required a majority of the four top BLAST matches to vote for the same kingdom (archaea, bacteria, eukaryota, or viruses; see <xref ref-type="sec" rid="s3">Materials and Methods</xref>
 [Kingdom assignment strategy and its evaluation]). The second method required all eight top BLAST matches to vote for the same kingdom. The last method we used was the scaffold-based kingdom assignment described in <xref ref-type="sec" rid="s3">Materials and Methods</xref>
 (Kingdom assignment strategy and its evaluation). <xref ref-type="fig" rid="pbio-0050016-g016">Figure 16</xref>
 shows the results of using these assignments to infer the kingdom of GOS-only clusters (<xref ref-type="fig" rid="pbio-0050016-g016">Figure 16</xref>
D–<xref ref-type="fig" rid="pbio-0050016-g016">16</xref>
F) and their neighboring ORFs (<xref ref-type="fig" rid="pbio-0050016-g016">Figure 16</xref>
A–<xref ref-type="fig" rid="pbio-0050016-g016">16</xref>
C). GOS-only clusters were assigned a kingdom only if >50% of their neighboring ORFs were assigned the same kingdom. The general trends observed are the same for each method, though the coverage decreases slightly for the more stringent methods.</p>
</sec>
<sec id="s3r"><title>Characteristics and kingdom distribution of known protein domains.</title>
<p>For these analyses we used the predicted proteins from the public (NCBI-nr, PG, TGI-EST, and ENS) and GOS datasets. The public dataset contains multiple identical copies of some sequences due to overlaps between the source datasets. For example, many sequences in PG are also found in NCBI-nr. We filtered the public set at 100% identity to avoid overcounting these sequences. Because this filtering was necessary for the public dataset, the GOS dataset was also filtered at 100% identity. If two or more sequences were 100% identical at the residue level, but were of different lengths, only the longest sequence was kept. The resulting datasets of nonredundant proteins are referred to as public-100 and GOS-100.</p>
<p>We assigned each protein in public-100 to a kingdom based on the species annotations provided in the source datasets (NCBI-nr, Ensembl, TIGR, and PG). The NCBI taxonomy tree was used to determine the kingdom of each species. Of 3,167,979 protein sequences in public-100, 3,158,907 can be annotated by kingdom. The remaining 9,072 sequences are largely synthetic.</p>
<p>Determining the kingdom of origin of an environmental sequence can be difficult; while an unambiguous assignment can be made for some sequences, others can be assigned only tentatively or not at all. Therefore, we took a probabilistic approach (kingdom-weighting method), calculating “weights” or probabilities that each protein sequence originated from a given kingdom.</p>
<p>The top four BLAST matches (<italic>E-</italic>
value < 1 × 10<sup>−10</sup>
) of GOS ORFs to NCBI-nr were considered. The kingdom of origin for each match was determined. We pooled these “kingdom votes” for each scaffold, since (presuming accurate assembly) each scaffold must come from a single species and hence from a single kingdom. Each ORF on a scaffold contributed up to four votes. If an ORF had fewer than four BLAST matches with an <italic>E-</italic>
value < 1 × 10<sup>−10</sup>
, then it contributed fewer votes. ORFs with no BLAST matches contributed no votes.</p>
<p>In many cases, the votes were not unanimous, indicating that some uncertainty must be associated with any kingdom assignment. An additional source of uncertainty is the finite number of votes. We accounted for these statistical issues by applying the following procedure to each scaffold. First, two pseudocounts were added to the votes for the “unknown” kingdom to represent the uncertainty that remains even when votes are unanimous (especially when there are few votes). The frequency of votes for each kingdom was calculated. The vote frequency for a kingdom provides the maximum likelihood estimate of the kingdom probability (i.e., the vote frequency that would have been observed on a scaffold of similar composition but with infinitely many voting ORFs). However, that estimate may not be accurate or precise. Therefore, the multinomial standard deviation was calculated for each vote frequency <italic>p</italic>
 as SQRT [<italic>p</italic>
 × (1 − <italic>p</italic>
)/(<italic>n</italic>
 − 1)], where <italic>n</italic>
 is the number of votes. A distance of two standard deviations from the mean corresponds roughly to a 95% confidence interval. Thus, two standard deviations were subtracted from each vote frequency, and called the result (or zero, if the result was negative) the “kingdom weight.” This “kingdom weight” is a conservative estimate. There is 95% chance that the actual kingdom probability is greater.</p>
<p>The kingdom weights do not sum to one because of the standard deviation penalty. The difference between the sum of the kingdom weights and unity is a measure of the total uncertainty about the kingdom assignment. This is called the “unknown weight.”</p>
<p>Finally, we assigned each ORF the kingdom weights calculated for the scaffold as a whole. This procedure assigned kingdom weights to many ORFs with no BLAST matches. Overall, 4,745,649 (84%) of the 5,654,638 proteins in GOS-100 receive nonzero kingdom weights.</p>
<p>The kingdom weights calculated in this way provide a basis for estimating the proportion of sequences originating from each kingdom, <italic>p</italic>
<sub><sc>GOS</sc>
</sub>
(<italic>K</italic>
). The weights over all sequences in GOS-100 were summed for each of the known kingdoms, and divided by the sum of the weights for all kingdoms (excluding the unknown weight). This procedure suggested that 96% of the sequences are bacterial, a somewhat higher proportion than is estimated by the method described in <xref ref-type="sec" rid="s3">Materials and Methods</xref>
 (Kingdom assignment strategy and its evaluation). Similarly, kingdom proportions, <italic>p</italic>
<sub><sc>GOS</sc>
–Pfam</sub>
(<italic>K</italic>
), were calculated for the subset of GOS-100 sequences that have a significant Pfam hit, and 97% are found to be bacterial.</p>
<p>We used the kingdom weights directly in the analyses where possible (e.g., to calculate the expected kingdom distribution of a given set of proteins by summing the weights). However, it was necessary in some cases to use discrete assignments of a single kingdom to each ORF. A tentative assignment can be made for a given scaffold by choosing the kingdom with the highest weight. The possibility remains, in this case, that a fraction of the “unknown” weight should rightfully belong to a different kingdom. However, if a kingdom weight is greater than 0.5, then this danger is averted, and a “confident” assignment of the scaffold and its constituent ORFs to that kingdom can be made.</p>
<p>Given the uncertainty penalty above, achieving a kingdom weight greater than 0.5 generally requires overwhelming support for one kingdom over the others. In particular, on a given scaffold, at least eight unanimous votes for a kingdom are needed (i.e., two ORFs contributing four votes each) to make a confident assignment to that kingdom. Any disagreement between the votes increases the required number rapidly: for instance, 15 votes for a single kingdom are required to override four votes for other kingdoms.</p>
<p>“Confident” kingdom assignments were made for 2,626,178 (46%) of the 5,654,638 proteins in GOS-100.</p>
<p>In the analysis that identified new multi-kingdom Pfams, we used the subset of confidently kingdom-annotated proteins. Here, a Pfam model was designated as “kingdom-specific” in public-100 if there were only matches to proteins in one particular kingdom, and no “unknown” matches. A Pfam model that was kingdom specific in public-100 was further designated as newly “multi-kingdom” if it had matches to one or more GOS-100 proteins that were confidently labeled as belonging to a kingdom different from that found in the public-100 matches. Also, we filtered Pfam matches with an <italic>E-</italic>
value cutoff of 1 × 10<sup>−10</sup>
. In every case, the bit score is at least five bits greater than the trusted cutoff for the model. In addition to passing the “confident” criteria, the kingdom assignments were all confirmed by visual inspection of the BLAST kingdom vote distributions for the respective scaffolds. Because the criteria for a “confident” kingdom assignment were conservative, there were only one or a few confident assignments for each domain to a “new” kingdom. The “confident” criteria are especially difficult to meet in the case of kingdom-crossing due to the votes contributed by the crossing protein. For instance, because the IDO domain itself always contributes four votes for “Eukaryota,” at least 15 votes for “Bacteria” were required to call a scaffold “bacterial.” Thus, many scaffolds have no confident kingdom assignment.</p>
<p>We compared the relative diversities of protein families between GOS-100 and public-100 as represented by Pfam sequence models. In order to do this, the number of matches expected to be found for each Pfam model in the GOS-100 data was computed, assuming that the matches were distributed among the models in the same proportions that they were in the public-100 data. These “expected” match counts were compared with the observed counts to identify domains that are more diverse in GOS-100 than in public-100 and vice versa.</p>
<p>Because kingdoms differ in their protein usage, Pfam models match sequences from different kingdoms with different frequencies, and some models match sequences exclusively from one kingdom. Thus, to calculate the expected number of matches to a given Pfam in GOS-100 based on the number of matches observed in public-100, we corrected for the radically different kingdom composition of the two datasets.</p>
<p>The expected proportion of all Pfam matches in GOS-100 that are to a given model <italic>M</italic>
 was calculated as follows. First, we made a simplifying assumption that sequences from different kingdoms were equally likely to have a Pfam hit, and thus that the Pfam matches in GOS-100 would be distributed among the kingdoms according to the kingdom proportions calculated using the weighted method above (for instance, it is assumed that 97% of the matches would be to bacterial sequences). Probability that a Pfam hit in GOS-100 is from <italic>K</italic>
 ≈ <italic>p</italic>
<sub>GOS-Pfam</sub>
(<italic>K</italic>
) (for sequences in GOS-100 with at least one Pfam hit) for kingdoms <italic>K</italic>
 in {Archae, Bacteria, Eukaryotes, Viruses}.</p>
<p>Second, we assumed that Pfam models match with the same relative rates within each kingdom in GOS-100 as they do in public-100. For instance, since twice as many SH3 domains as SH2 domains are found in public-100 eukaryotic sequences, the same ratio is expected to be found in GOS-100 eukaryotic sequences. Using the public-100 data, we calculated the frequency of matches for each Pfam model <italic>M</italic>
 within each kingdom, relative to the total number of Pfam matches to that kingdom. Pseudocounts of one were added to both the “match” and “no match” counts (i.e., using a uniform Dirichlet prior), to allow proper statistical treatment of families with few or no matches in the public databases for some kingdom. In Equation 5 below, Obs<sub>public</sub>
(<italic>M,K</italic>
) is the observed number of public-100 hits to <italic>M</italic>
 in <italic>K,</italic>
 and Obs<sub>public</sub>
(<italic>K</italic>
) is the observed number of public-100 hits to all models in <italic>K</italic>.
					<disp-formula id="pbio-0050016-e005"><graphic xlink:href="pbio.0050016.e005.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
</p>
<p>By multiplying the conditional probability of each model given a kingdom by the respective kingdom probability (<italic>p</italic>
<sub>GOS-Pfam</sub>
(<italic>K</italic>), calculated as described above in “Kingdom annotation of GOS-100 proteins: kingdom weighting method”), the proportions of Pfam matches in GOS-100 due to each combination of kingdom and Pfam model were then predicted. Finally, these predictions were summed across kingdoms to obtain the expected proportion of matches to each model.
					<disp-formula id="pbio-0050016-e006"><graphic xlink:href="pbio.0050016.e006.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
</p>
<p>Relatively fewer GOS-100 sequences than public-100 sequences have a Pfam hit (likely because Pfam is based on sequences in the public databases). To avoid systematically overestimating the number of GOS-100 hits for each Pfam model due to this global effect, the predicted counts were based on the observed total number of Pfam matches to all models in GOS-100, and an attempt was made to predict only how these matches are distributed among models. Thus, the expected number of Pfam hits to a given model in GOS-100 is equal to the expected proportion of hits to that model, as calculated above, multiplied by the total number of Pfam hits. In the equation below, Obs<sub>GOS</sub> is the total number of Pfam hits to all models in GOS-100.
					<disp-formula id="pbio-0050016-e007"><graphic xlink:href="pbio.0050016.e007.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
</p>
<p>In summary, calculation of the expected number of Pfam hits to a model <italic>M</italic> in GOS-100 for all kingdoms can be expressed in one equation as follows:
					<disp-formula id="pbio-0050016-e008"><graphic xlink:href="pbio.0050016.e008.jpg" position="anchor" mimetype="image"></graphic>
</disp-formula>
where Obs<sub>public</sub>
(<italic>M,K</italic>
) is the observed number of public-100 hits to model <italic>M</italic>
 in <italic>K,</italic>
 Obs<sub>public</sub>
(<italic>K</italic>
) is the observed number of public-100 hits to all models in <italic>K</italic>
, <italic>p</italic>
<sub>GOS-Pfam</sub>
(<italic>K</italic>
) is the proportion of GOS-100 sequences that have at least one Pfam hit in <italic>K</italic>
, and Obs<sub>GOS</sub> is the total number of Pfam hits to all models in GOS-100.
				</p>
<p>The ratio of the observed to the predicted number of hits for each Pfam model is a measure of the relative diversity of that Pfam family in GOS-100 compared to public-100, corrected for the differing kingdom proportions in the two datasets. We computed the significance of this ratio using the CHITEST function in Excel, which implements the standard Pearson's Chi-square test with one degree of freedom and expresses the result as a probability. For many protein families, the difference in diversity between the two datasets was so pronounced that Excel reports a probability of zero due to numerical underflow, indicating a <italic>p</italic>
-value less than 1 × 10<sup>−303</sup>
.</p>
</sec>
<sec id="s3s"><title>IDO analysis.</title>
<p>The GOS-100 and public-100 sequences selected for the IDO family alignment matched the PF01231 Pfam fs model with a score above the trusted bit-score cutoff at the sequence level. In addition, the sequences were required to have the width of their matching region spanning over 50% of the Pfam IDO HMM model length. Next, all sequence matches to the Pfam IDO model from the NCBI-nr database downloaded on March 6, 2006, were added (these also satisfied the trusted score cutoff and model alignment span criteria). An additional 26 IDO sequences were found in the new sequence database relative to the GOS public sequence data freeze after filtering for identical and 1 aa different sequences and presence of first and last residues in the final trimmed alignment. Jevtrace (version 3.14) [<xref ref-type="bibr" rid="pbio-0050016-b149">149</xref>
] was used to assess alignment quality, to remove sequences problematic for alignment, to remove sequence redundancy (at the 0-aa and 1-aa difference levels) while allowing for redundant nonoverlapping sequences, to trim the alignment to a block of aligned columns, to delete columns with more than 50% gaps, and to remove sequences with missing first or last residues. One sequence (GenBank ID 72038700) was likely a multidomain protein problematic for alignment and was removed manually. This set of procedures produced a block sequence alignment of 144 sequences and 231 characters. We aligned sequences with MUSCLE (version 3.52) [<xref ref-type="bibr" rid="pbio-0050016-b134">134</xref>
] using default parameters. The final alignment was used to reconstruct phylogenies with a series of phylogeny reconstruction methods: PHYML [<xref ref-type="bibr" rid="pbio-0050016-b150">150</xref>
], Tree-Puzzle [<xref ref-type="bibr" rid="pbio-0050016-b151">151</xref>
], Weighbor [<xref ref-type="bibr" rid="pbio-0050016-b152">152</xref>
], and the protpars program from the PHYLIP package (version 3.6a3) [<xref ref-type="bibr" rid="pbio-0050016-b145">145</xref>
]. Bootstrapping was performed with the protpars program using 1,000 bootstrap replicates, each with 100 jumbles; the majority consensus tree was produced by the consense program in the PHYLIP package.</p>
</sec>
<sec id="s3t"><title>Structural genomics implications.</title>
<p>The Pfam5000 families used in this study were chosen from among the manually curated (Pfam-A) families in from Pfam version 17. We added 2,932 families with a structurally characterized representative as of October 27, 2005, to the Pfam5000 in descending order by family size, followed by 2,068 additional families without a structurally characterized representative, in descending order by family size. Pre-GOS family size was calculated as the number of sequences in public-100 that had a match to the Pfam family. Post-GOS family size was calculated as the number of sequences in public-100 and GOS-100 that matched each family. We used the results of the HMM profiling effort (using Pfams) used for this analysis.</p>
<p>Coverage of GOS-100 and public-100 sequences by both versions of the Pfam5000 was measured using the subset of families in Pfam 17 that were also in Pfam 16. This was done in order to enable direct comparison of coverage results with a previous study of coverage of fully sequenced bacterial and eukaryotic genomes [<xref ref-type="bibr" rid="pbio-0050016-b073">73</xref>
]. The versions of Pfam are similar in size (Pfam 16 contains 7,677 families, and Pfam 17 contains 7,868 families).</p>
</sec>
<sec id="s3u"><title>Phylogeny construction for various families.</title>
<p>For the UVDE family, sequences were aligned using MUSCLE [<xref ref-type="bibr" rid="pbio-0050016-b134">134</xref>
] and a tree was built using QuickTree [<xref ref-type="bibr" rid="pbio-0050016-b144">144</xref>
].</p>
<p>For the PP2C family, the catalytic domain portions of the sequences were identified and aligned using the PP2C Pfam model. Sequences that contained ≥70% nongaps in this alignment were used to generate a phylogenetic tree of all the PP2C-like sequences. The phylogeny was inferred using the protdist and neighbor-joining programs in PHYLIP [<xref ref-type="bibr" rid="pbio-0050016-b145">145</xref>
]. We used 1,941 total PP2C-like sequences for the phylogenetic analysis. The breakdown was as follows: public eukaryotic sequences, 73%; public bacterial sequences, 14%; GOS-eukaryotic sequences, 2%; GOS-bacterial sequences, 10%; and GOS-viral and GOS-unknown sequences, less than 1% combined.</p>
<p>For the type II GS family, sequences in GOS and NCBI-nr were searched with a type II GS HMM constructed from 17 previously known bacterial and eukaryotic type II GS sequences. Matching sequences from NCBI-nr and GOS were filtered separately for redundancy at 98% identity; the combined set of sequences was aligned and a neighbor-joining tree was constructed.</p>
<p>For the RuBisCO family, matching RuBisCO sequences from GOS and NCBI-nr were filtered separately for redundancy at 90% identity, resulting in 724 sequences in total. The 724 RuBisCO sequences were then aligned and a neighbor-joining tree was constructed.</p>
</sec>
<sec id="s3v"><title>Identification of proteases.</title>
<p>We clustered sequences in the MEROPS Peptidase Database [<xref ref-type="bibr" rid="pbio-0050016-b100">100</xref>
] using CD-HIT [<xref ref-type="bibr" rid="pbio-0050016-b116">116</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b117">117</xref>
] at 40% similarity level. This resulted in 7,081 sequences, which were then divided into groups based on catalytic type and Clan identifier. These sequences were used as queries to search against a clustered version of NCBI-nr (clustered at 60% similarity threshold) using BLASTP (<italic>E-</italic>
value ≤ 1 × 10<sup>−10</sup>
). A similar search was carried out against GOS (clustered at 60% similarity threshold). <xref ref-type="fig" rid="pbio-0050016-g017">Figure 17</xref>
 shows the content of protease types in NCBI-nr and GOS together with the kingdom distributions. <xref ref-type="fig" rid="pbio-0050016-g018">Figure 18</xref>
 shows the content of bacterial protease clans.</p>
</sec>
<sec id="s3w"><title>Metabolic enzymes in GOS.</title>
<p>Hmmsearch from the HMMER package [<xref ref-type="bibr" rid="pbio-0050016-b105">105</xref>
] was used to search the GOS sequences for different GS types. The GlnA TIGRFAM model was used for finding GSI sequences. The HMMs built from known examples of 17 GSII and 18 GSIII sequences from NCBI-nr were used to search the GOS sequences.</p>
</sec>
<sec id="s3x"><title>Identification of ORFans in NCBI-nr.</title>
<p>ORFans are proteins that do not have any recognizable homologs in known protein databases. A straightforward way to identify ORFans is through all-against-all sequence comparison using relaxed match parameters. However, this is not computationally practical. An effective approach is to first remove the non-ORFans that can be easily found, and then to identify ORFans from the remaining sequences.</p>
<p>We identified non-ORFans by clustering the NCBI-nr with CD-HIT [<xref ref-type="bibr" rid="pbio-0050016-b116">116</xref>
,<xref ref-type="bibr" rid="pbio-0050016-b117">117</xref>
], an ultrafast sequence clustering program. A multistep iterated clustering was performed with a series of decreasing similarity thresholds. NCBI-nr was first clustered to NCBI-nr90, where sequences with >90% similarities were grouped. NCBI-nr90 was then clustered to NCBI-nr80/70/60/50 and finally NCBI-nr30. After each clustering stage, the total number of clusters of NCBI-nr was decreased and non-ORFans were identified. A one-step clustering from NCBI-nr directly to NCBI-nr30 can be performed. However, the multistep clustering is computationally more efficient.</p>
<p>At the 30% similarity level, all the NCBI-nr proteins were grouped into 391,833 clusters, including 259,571 singleton clusters. The proteins in nonsingleton clusters are by definition non-ORFans. However, proteins that remain as singletons are not necessarily ORFans, because their similarity to other proteins may not be reported for two reasons: (1) significant sequence similarity can be <30%; and (2) in order to prevent a cluster from being too diverse, CD-HIT, like all other clustering algorithms, may not add a sequence to that cluster even if the similarity between this sequence and a sequence in that cluster meet the similarity threshold.</p>
<p>The 259,571 singletons were compared to NCBI-nr with BLASTP [<xref ref-type="bibr" rid="pbio-0050016-b038">38</xref>
] to identify real ORFans from them. The default low-complexity filter was enabled in the BLAST comparisons, and similarity threshold in the form of an <italic>E-</italic>
value was set to 1 × 10<sup>−6</sup>
. In the end, 84,911 proteins with at least 100 aa are identified as ORFans. About 100,000 short ORFans less than 100 aa were removed from this study, because they may not be real proteins.</p>
</sec>
<sec id="s3y"><title>Genome sequencing projects and rate of discovery.</title>
<p>We used Ensembl sequences for <italic>Homo sapiens, Mus musculus, Rattus norvegicus, Canis familiaris,</italic>
 and <italic>Pan troglodytes.</italic>
 Their clustering information is shown in <xref ref-type="table" rid="pbio-0050016-t015">Table 15</xref>
. When we considered the datasets in the order HS, HS + MM, HS + MM + RN, HS + MM + RN + CF, and HS + MM + RN + CF + PT, the numbers of distinct clusters were 10,536, 12,731, 13,605, 14,606, and 14,993, respectively. These numbers were compared against a random subset of NCBI-nr bacterial sequences (of a similar size) and also against a random subset of GOS sequences. We also randomized the order of the mammalian sequences to produce a dataset that was independent of the genome order being considered.</p>
</sec>
</sec>
<sec sec-type="supplementary-material" id="s4"><title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pbio-0050016-sd001"><label>Protocol S1</label>
<caption><title>Supplementary Information</title>
<p>(25 KB DOC)</p>
</caption>
<media xlink:href="pbio.0050016.sd001.doc" mimetype="application" mime-subtype="msword"><caption><p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<sec id="s4a"><title>Accession Numbers</title>
<p>All NCBI-nr sequences from February 10, 2005 were used in our analysis. <xref ref-type="supplementary-material" rid="pbio-0050016-sd001">Protocol S1</xref>
 lists the GenBank (<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/Genbank">http://www.ncbi.nlm.nih.gov/Genbank</ext-link>
) accession numbers of (1) the genomic sequences used in the PG set, (2) the sequences used in building GS profiles, and (3) the NCBI-nr sequences used in building the IDO phylogeny. The other GenBank sequences discussed in this paper are <named-content content-type="genus-species">Bacillus sp</named-content>
. NRRL B-14911 (89089741), <named-content content-type="genus-species">Janibacter</named-content>
 sp. HTCC2649 (84385106), <named-content content-type="genus-species">Erythrobacter litoralis</named-content>
 (84785911), and <named-content content-type="genus-species">Nitrosococcus oceani</named-content>
 (76881875). The Pfam (<ext-link ext-link-type="uri" xlink:href="http://pfam.cgb.ki.se">http://pfam.cgb.ki.se</ext-link>
) structures discussed in this paper are envelope glycoprotein GP120 (PF00516), reverse transcriptase (PF00078), retroviral aspartyl protease (PF00077), bacteriophage T4-like capsid assembly protein (Gp20) (PF07230), major capsid protein Gp23 (PF07068), phage tail sheath protein (PF04984), IDO (PF01231), poxvirus A22 protein family (PF04848), and PP2C (PF00481). The glutamine synthetase TIGRFAM (<ext-link ext-link-type="uri" xlink:href="http://www.tigr.org/TIGRFAMs">http://www.tigr.org/TIGRFAMs</ext-link>
) used in the paper is GlnA: glutamine synthetase, type I (TIGR00653). The PDB (<ext-link ext-link-type="uri" xlink:href="http://www.rcsb.org/pdb">http://www.rcsb.org/pdb</ext-link>
) identifiers and the names of the eight PDB ORFans with GOS matches are: restriction endonuclease MunI (1D02), restriction endonuclease BglI (1DMU), restriction endonuclease BstYI (1SDO), restriction endonuclease HincII (1TX3); alpha-glucosyltransferase (1Y8Z), hypothetical protein PA1492 (1T1J), putative protein (1T6T), and hypothetical protein AF1548 (1Y88).</p>
</sec>
</sec>
</body>
<back><ack><p>We are indebted to a large group of individuals and groups for facilitating our sampling and analysis. We thank the governments of Canada, Mexico, Honduras, Costa Rica, Panama, and Ecuador and French Polynesia/France for facilitating sampling activities. All sequencing data collected from waters of the above-named countries remain part of the genetic patrimony of the country from which they were obtained. We also acknowledge TimeLogic (Active Motif, Inc.) and in particular Chris Hoover and Joe Salvatore for helping make the DeCypher system available to us; the Department of Energy for use of their NERSC Seaborg compute cluster; Marty Stout, Randy Doering, Tyler Osgood, Scott Collins, and Marshall Peterson (J. Craig Venter Institute) for help with the compute resources; Peter Davies and Saul Kravitz (J. Craig Venter Institute) for help with data accessibility issues; Kelvin Li and Nelson Axelrod (J. Craig Venter Institute) for discussions on data formats; K. Eric Wommack (University of Delaware, Newark) and the captain and crew of the R/V <italic>Cape Henlopen</italic>
 for their assistance in field collection of Chesapeake Bay virioplankton samples; John Glass (J. Craig Venter Institute) for assistance with the collection and processing of the virioplankton samples; Beth Hoyle and Laura Sheahan (J. Craig Venter Institute) for help with paper editing; and Matthew LaPointe and Jasmine Pollard (J. Craig Venter Institute) for help with figure formatting. STM, MPJ, CvB, DAS, and SEB acknowledge Kasper Hansen for statistical advice. We also acknowledge the reviewers for their valuable comments.</p>
</ack>
<glossary><title>Abbreviations</title>
<def-list><def-item><term>aa</term>
<def><p>amino acid</p>
</def>
</def-item>
<def-item><term>ENS</term>
<def><p>Ensembl</p>
</def>
</def-item>
<def-item><term>EST</term>
<def><p>expressed sequence tag</p>
</def>
</def-item>
<def-item><term>GO</term>
<def><p>Gene Ontology</p>
</def>
</def-item>
<def-item><term>GOS</term>
<def><p>Global Ocean Sampling</p>
</def>
</def-item>
<def-item><term>GS</term>
<def><p>glutamine synthetase</p>
</def>
</def-item>
<def-item><term>HMM</term>
<def><p>hidden Markov model</p>
</def>
</def-item>
<def-item><term>IDO</term>
<def><p>indoleamine 2,3-dioxygenase</p>
</def>
</def-item>
<def-item><term>NCBI</term>
<def><p>National Center for Biotechnology Information</p>
</def>
</def-item>
<def-item><term>ORF</term>
<def><p>open reading frame</p>
</def>
</def-item>
<def-item><term>PDB</term>
<def><p>Protein Data Bank</p>
</def>
</def-item>
<def-item><term>PG</term>
<def><p>prokaryotic genomes</p>
</def>
</def-item>
<def-item><term>PP2C</term>
<def><p>protein phosphatase 2C</p>
</def>
</def-item>
<def-item><term>PSI</term>
<def><p>Protein Structure Initiative</p>
</def>
</def-item>
<def-item><term>RLP</term>
<def><p>RuBisCO-like protein</p>
</def>
</def-item>
<def-item><term>TGI</term>
<def><p>TIGR gene indices</p>
</def>
</def-item>
<def-item><term>TC</term>
<def><p>trusted cutoff</p>
</def>
</def-item>
<def-item><term>UVDE</term>
<def><p>UV dimer endonuclease</p>
</def>
</def-item>
</def-list>
</glossary>
<ref-list><title>References</title>
<ref id="pbio-0050016-b001"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tatusov</surname>
<given-names>RL</given-names>
</name>
<name><surname>Galperin</surname>
<given-names>MY</given-names>
</name>
<name><surname>Natale</surname>
<given-names>DA</given-names>
</name>
<name><surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>The COG database: A tool for genome-scale analysis of protein functions and evolution</article-title>
<source>Nucleic Acids Res</source>
<volume>28</volume>
<fpage>33</fpage>
<lpage>36</lpage>
<pub-id pub-id-type="pmid">10592175</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b002"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Murzin</surname>
<given-names>AG</given-names>
</name>
<name><surname>Brenner</surname>
<given-names>SE</given-names>
</name>
<name><surname>Hubbard</surname>
<given-names>T</given-names>
</name>
<name><surname>Chothia</surname>
<given-names>C</given-names>
</name>
</person-group>
<year>1995</year>
<article-title>SCOP: A structural classification of proteins database for the investigation of sequences and structures</article-title>
<source>J Mol Biol</source>
<volume>247</volume>
<fpage>536</fpage>
<lpage>540</lpage>
<pub-id pub-id-type="pmid">7723011</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b003"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Orengo</surname>
<given-names>CA</given-names>
</name>
<name><surname>Michie</surname>
<given-names>AD</given-names>
</name>
<name><surname>Jones</surname>
<given-names>S</given-names>
</name>
<name><surname>Jones</surname>
<given-names>DT</given-names>
</name>
<name><surname>Swindells</surname>
<given-names>MB</given-names>
</name>
<etal></etal>
</person-group>
<year>1997</year>
<article-title>CATH—A hierarchic classification of protein domain structures</article-title>
<source>Structure</source>
<volume>5</volume>
<fpage>1093</fpage>
<lpage>1108</lpage>
<pub-id pub-id-type="pmid">9309224</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b004"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thornton</surname>
<given-names>JM</given-names>
</name>
<name><surname>Orengo</surname>
<given-names>CA</given-names>
</name>
<name><surname>Todd</surname>
<given-names>AE</given-names>
</name>
<name><surname>Pearl</surname>
<given-names>FM</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Protein folds, functions and evolution</article-title>
<source>J Mol Biol</source>
<volume>293</volume>
<fpage>333</fpage>
<lpage>342</lpage>
<pub-id pub-id-type="pmid">10529349</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b005"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Todd</surname>
<given-names>AE</given-names>
</name>
<name><surname>Orengo</surname>
<given-names>CA</given-names>
</name>
<name><surname>Thornton</surname>
<given-names>JM</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Evolution of function in protein superfamilies, from a structural perspective</article-title>
<source>J Mol Biol</source>
<volume>307</volume>
<fpage>1113</fpage>
<lpage>1143</lpage>
<pub-id pub-id-type="pmid">11286560</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b006"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Coulson</surname>
<given-names>AF</given-names>
</name>
<name><surname>Moult</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>A unifold, mesofold, and superfold model of protein fold use</article-title>
<source>Proteins</source>
<volume>46</volume>
<fpage>61</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="pmid">11746703</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b007"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rost</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Did evolution leap to create the protein universe?</article-title>
<source>Curr Opin Struct Biol</source>
<volume>12</volume>
<fpage>409</fpage>
<lpage>416</lpage>
<pub-id pub-id-type="pmid">12127462</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b008"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kinch</surname>
<given-names>LN</given-names>
</name>
<name><surname>Grishin</surname>
<given-names>NV</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Evolution of protein structures and functions</article-title>
<source>Curr Opin Struct Biol</source>
<volume>12</volume>
<fpage>400</fpage>
<lpage>408</lpage>
<pub-id pub-id-type="pmid">12127461</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b009"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Galperin</surname>
<given-names>MY</given-names>
</name>
<name><surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Who's your neighbor? New computational approaches for functional genomics</article-title>
<source>Nat Biotechnol</source>
<volume>18</volume>
<fpage>609</fpage>
<lpage>613</lpage>
<pub-id pub-id-type="pmid">10835597</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b010"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Venter</surname>
<given-names>JC</given-names>
</name>
<name><surname>Remington</surname>
<given-names>K</given-names>
</name>
<name><surname>Heidelberg</surname>
<given-names>JF</given-names>
</name>
<name><surname>Halpern</surname>
<given-names>AL</given-names>
</name>
<name><surname>Rusch</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Environmental genome shotgun sequencing of the Sargasso Sea</article-title>
<source>Science</source>
<volume>304</volume>
<fpage>66</fpage>
<lpage>74</lpage>
<pub-id pub-id-type="pmid">15001713</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b011"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tringe</surname>
<given-names>SG</given-names>
</name>
<name><surname>Rubin</surname>
<given-names>EM</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Metagenomics: DNA sequencing of environmental samples</article-title>
<source>Nat Rev Genet</source>
<volume>6</volume>
<fpage>805</fpage>
<lpage>814</lpage>
<pub-id pub-id-type="pmid">16304596</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b012"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tringe</surname>
<given-names>SG</given-names>
</name>
<name><surname>von Mering</surname>
<given-names>C</given-names>
</name>
<name><surname>Kobayashi</surname>
<given-names>A</given-names>
</name>
<name><surname>Salamov</surname>
<given-names>AA</given-names>
</name>
<name><surname>Chen</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>Comparative metagenomics of microbial communities</article-title>
<source>Science</source>
<volume>308</volume>
<fpage>554</fpage>
<lpage>557</lpage>
<pub-id pub-id-type="pmid">15845853</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b013"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<name><surname>Putnam</surname>
<given-names>N</given-names>
</name>
<name><surname>Preston</surname>
<given-names>CM</given-names>
</name>
<name><surname>Detter</surname>
<given-names>JC</given-names>
</name>
<name><surname>Rokhsar</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Reverse methanogenesis: Testing the hypothesis with environmental genomics</article-title>
<source>Science</source>
<volume>305</volume>
<fpage>1457</fpage>
<lpage>1462</lpage>
<pub-id pub-id-type="pmid">15353801</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b014"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tyson</surname>
<given-names>GW</given-names>
</name>
<name><surname>Chapman</surname>
<given-names>J</given-names>
</name>
<name><surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name><surname>Allen</surname>
<given-names>EE</given-names>
</name>
<name><surname>Ram</surname>
<given-names>RJ</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Community structure and metabolism through reconstruction of microbial genomes from the environment</article-title>
<source>Nature</source>
<volume>428</volume>
<fpage>37</fpage>
<lpage>43</lpage>
<pub-id pub-id-type="pmid">14961025</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b015"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bateman</surname>
<given-names>A</given-names>
</name>
<name><surname>Coin</surname>
<given-names>L</given-names>
</name>
<name><surname>Durbin</surname>
<given-names>R</given-names>
</name>
<name><surname>Finn</surname>
<given-names>RD</given-names>
</name>
<name><surname>Hollich</surname>
<given-names>V</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>The Pfam protein families database</article-title>
<source>Nucleic Acids Res</source>
<volume>32</volume>
<fpage>D138</fpage>
<lpage>D141</lpage>
<pub-id pub-id-type="pmid">14681378</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b016"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Corpet</surname>
<given-names>F</given-names>
</name>
<name><surname>Gouzy</surname>
<given-names>J</given-names>
</name>
<name><surname>Kahn</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>The ProDom database of protein domain families</article-title>
<source>Nucleic Acids Res</source>
<volume>26</volume>
<fpage>323</fpage>
<lpage>326</lpage>
<pub-id pub-id-type="pmid">9399865</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b017"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sasson</surname>
<given-names>O</given-names>
</name>
<name><surname>Vaaknin</surname>
<given-names>A</given-names>
</name>
<name><surname>Fleischer</surname>
<given-names>H</given-names>
</name>
<name><surname>Portugaly</surname>
<given-names>E</given-names>
</name>
<name><surname>Bilu</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>ProtoNet: Hierarchical classification of the protein space</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>348</fpage>
<lpage>352</lpage>
<pub-id pub-id-type="pmid">12520020</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b018"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pipenbacher</surname>
<given-names>P</given-names>
</name>
<name><surname>Schliep</surname>
<given-names>A</given-names>
</name>
<name><surname>Schneckener</surname>
<given-names>S</given-names>
</name>
<name><surname>Schonhuth</surname>
<given-names>A</given-names>
</name>
<name><surname>Schomburg</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>ProClust: Improved clustering of protein sequences with an extended graph-based approach</article-title>
<source>Bioinformatics</source>
<volume>18</volume>
<issue>Suppl 2</issue>
<fpage>S182</fpage>
<lpage>S191</lpage>
<pub-id pub-id-type="pmid">12386002</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b019"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<name><surname>Bairoch</surname>
<given-names>A</given-names>
</name>
<name><surname>Wu</surname>
<given-names>CH</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Protein sequence databases</article-title>
<source>Curr Opin Chem Biol</source>
<volume>8</volume>
<fpage>76</fpage>
<lpage>80</lpage>
<pub-id pub-id-type="pmid">15036160</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b020"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gasteiger</surname>
<given-names>E</given-names>
</name>
<name><surname>Jung</surname>
<given-names>E</given-names>
</name>
<name><surname>Bairoch</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>SWISS-PROT: Connecting biomolecular knowledge via a protein database</article-title>
<source>Curr Issues Mol Biol</source>
<volume>3</volume>
<fpage>47</fpage>
<lpage>55</lpage>
<pub-id pub-id-type="pmid">11488411</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b021"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
<name><surname>Eddy</surname>
<given-names>SR</given-names>
</name>
<name><surname>Birney</surname>
<given-names>E</given-names>
</name>
<name><surname>Bateman</surname>
<given-names>A</given-names>
</name>
<name><surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Pfam: Multiple sequence alignments and HMM-profiles of protein domains</article-title>
<source>Nucleic Acids Res</source>
<volume>26</volume>
<fpage>320</fpage>
<lpage>322</lpage>
<pub-id pub-id-type="pmid">9399864</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b022"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Haft</surname>
<given-names>DH</given-names>
</name>
<name><surname>Selengut</surname>
<given-names>JD</given-names>
</name>
<name><surname>White</surname>
<given-names>O</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>The TIGRFAMs database of protein families</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>371</fpage>
<lpage>373</lpage>
<pub-id pub-id-type="pmid">12520025</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b023"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Haft</surname>
<given-names>DH</given-names>
</name>
<name><surname>Loftus</surname>
<given-names>BJ</given-names>
</name>
<name><surname>Richardson</surname>
<given-names>DL</given-names>
</name>
<name><surname>Yang</surname>
<given-names>F</given-names>
</name>
<name><surname>Eisen</surname>
<given-names>JA</given-names>
</name>
<etal></etal>
</person-group>
<year>2001</year>
<article-title>TIGRFAMs: A protein family resource for the functional identification of proteins</article-title>
<source>Nucleic Acids Res</source>
<volume>29</volume>
<fpage>41</fpage>
<lpage>43</lpage>
<pub-id pub-id-type="pmid">11125044</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b024"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<name><surname>Bairoch</surname>
<given-names>A</given-names>
</name>
<name><surname>Wu</surname>
<given-names>CH</given-names>
</name>
<name><surname>Barker</surname>
<given-names>WC</given-names>
</name>
<name><surname>Boeckmann</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>UniProt: The Universal Protein knowledgebase</article-title>
<source>Nucleic Acids Res</source>
<volume>32</volume>
<fpage>D115</fpage>
<lpage>D119</lpage>
<pub-id pub-id-type="pmid">14681372</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b025"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mulder</surname>
<given-names>NJ</given-names>
</name>
<name><surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<name><surname>Attwood</surname>
<given-names>TK</given-names>
</name>
<name><surname>Bairoch</surname>
<given-names>A</given-names>
</name>
<name><surname>Bateman</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>InterPro, progress and status in 2005</article-title>
<source>Nucleic Acids Res</source>
<volume>33</volume>
<fpage>D201</fpage>
<lpage>D205</lpage>
<pub-id pub-id-type="pmid">15608177</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b026"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heger</surname>
<given-names>A</given-names>
</name>
<name><surname>Holm</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Exhaustive enumeration of protein domain families</article-title>
<source>J Mol Biol</source>
<volume>328</volume>
<fpage>749</fpage>
<lpage>767</lpage>
<pub-id pub-id-type="pmid">12706730</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b027"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname>
<given-names>X</given-names>
</name>
<name><surname>Fan</surname>
<given-names>K</given-names>
</name>
<name><surname>Wang</surname>
<given-names>W</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>The number of protein folds and their distribution over families in nature</article-title>
<source>Proteins</source>
<volume>54</volume>
<fpage>491</fpage>
<lpage>499</lpage>
<pub-id pub-id-type="pmid">14747997</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b028"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kunin</surname>
<given-names>V</given-names>
</name>
<name><surname>Cases</surname>
<given-names>I</given-names>
</name>
<name><surname>Enright</surname>
<given-names>AJ</given-names>
</name>
<name><surname>de Lorenzo</surname>
<given-names>V</given-names>
</name>
<name><surname>Ouzounis</surname>
<given-names>CA</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Myriads of protein families, and still counting</article-title>
<source>Genome Biol</source>
<volume>4</volume>
<fpage>401</fpage>
<pub-id pub-id-type="pmid">12620116</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b029"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bru</surname>
<given-names>C</given-names>
</name>
<name><surname>Courcelle</surname>
<given-names>E</given-names>
</name>
<name><surname>Carrere</surname>
<given-names>S</given-names>
</name>
<name><surname>Beausse</surname>
<given-names>Y</given-names>
</name>
<name><surname>Dalmar</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>The ProDom database of protein domain families: More emphasis on 3D</article-title>
<source>Nucleic Acids Res</source>
<volume>33</volume>
<fpage>D212</fpage>
<lpage>D215</lpage>
<pub-id pub-id-type="pmid">15608179</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b030"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rusch</surname>
<given-names>DB</given-names>
</name>
<name><surname>Halpern</surname>
<given-names>AL</given-names>
</name>
<name><surname>Sutton</surname>
<given-names>G</given-names>
</name>
<name><surname>Heidelberg</surname>
<given-names>KB</given-names>
</name>
<name><surname>Williamson</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<year>2007</year>
<article-title>The <italic>Sorcerer II</italic>
 Gobal Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific</article-title>
<source>PLoS Biol</source>
<volume>5</volume>
<fpage>e77</fpage>
<comment>doi:<ext-link ext-link-type="doi" xlink:href="10.1371/journal.pbio.0050077">10.1371/journal.pbio.0050077</ext-link>
</comment>
<pub-id pub-id-type="pmid">17355176</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b031"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wheeler</surname>
<given-names>DL</given-names>
</name>
<name><surname>Barrett</surname>
<given-names>T</given-names>
</name>
<name><surname>Benson</surname>
<given-names>DA</given-names>
</name>
<name><surname>Bryant</surname>
<given-names>SH</given-names>
</name>
<name><surname>Canese</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<year>2006</year>
<article-title>Database resources of the National Center for Biotechnology Information</article-title>
<source>Nucleic Acids Res</source>
<volume>34</volume>
<fpage>D173</fpage>
<lpage>D180</lpage>
<pub-id pub-id-type="pmid">16381840</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b032"><citation citation-type="book"><collab>National Center for Biotechnology Information</collab>
<year>2005</year>
<source>Blast db [database]</source>
<publisher-loc>Washington (D.C.)</publisher-loc>
<publisher-name>National Center for Biotechnology Information</publisher-name>
<comment>Available: <ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/blast/db.">ftp://ftp.ncbi.nih.gov/blast/db</ext-link>
. Accessed 10 February 2005.</comment>
</citation>
</ref>
<ref id="pbio-0050016-b033"><citation citation-type="book"><collab>National Center for Biotechnology Information</collab>
<year>2005</year>
<source>Microbial Genome Projects db[database]</source>
<publisher-loc>Washington (D.C.)</publisher-loc>
<publisher-name>National Center for Biotechnology Information</publisher-name>
<comment>Available: <ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/genomes/Bacteria.">ftp://ftp.ncbi.nih.gov/genomes/Bacteria</ext-link>
. Accessed 10 February 2005.</comment>
</citation>
</ref>
<ref id="pbio-0050016-b034"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Quackenbush</surname>
<given-names>J</given-names>
</name>
<name><surname>Liang</surname>
<given-names>F</given-names>
</name>
<name><surname>Holt</surname>
<given-names>I</given-names>
</name>
<name><surname>Pertea</surname>
<given-names>G</given-names>
</name>
<name><surname>Upton</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>The TIGR gene indices: Reconstruction and representation of expressed gene sequences</article-title>
<source>Nucleic Acids Res</source>
<volume>28</volume>
<fpage>141</fpage>
<lpage>145</lpage>
<pub-id pub-id-type="pmid">10592205</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b035"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Birney</surname>
<given-names>E</given-names>
</name>
<name><surname>Andrews</surname>
<given-names>D</given-names>
</name>
<name><surname>Bevan</surname>
<given-names>P</given-names>
</name>
<name><surname>Caccamo</surname>
<given-names>M</given-names>
</name>
<name><surname>Cameron</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Ensembl 2004</article-title>
<source>Nucleic Acids Res</source>
<volume>32</volume>
<fpage>D468</fpage>
<lpage>D470</lpage>
<pub-id pub-id-type="pmid">14681459</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b036"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Birney</surname>
<given-names>E</given-names>
</name>
<name><surname>Andrews</surname>
<given-names>TD</given-names>
</name>
<name><surname>Bevan</surname>
<given-names>P</given-names>
</name>
<name><surname>Caccamo</surname>
<given-names>M</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>An overview of Ensembl</article-title>
<source>Genome Res</source>
<volume>14</volume>
<fpage>925</fpage>
<lpage>928</lpage>
<pub-id pub-id-type="pmid">15078858</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b037"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name><surname>Sutton</surname>
<given-names>GG</given-names>
</name>
<name><surname>Delcher</surname>
<given-names>AL</given-names>
</name>
<name><surname>Dew</surname>
<given-names>IM</given-names>
</name>
<name><surname>Fasulo</surname>
<given-names>DP</given-names>
</name>
<etal></etal>
</person-group>
<year>2000</year>
<article-title>A whole-genome assembly of <italic>Drosophila</italic>
</article-title>
<source>Science</source>
<volume>287</volume>
<fpage>2196</fpage>
<lpage>2204</lpage>
<pub-id pub-id-type="pmid">10731133</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b038"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Gish</surname>
<given-names>W</given-names>
</name>
<name><surname>Miller</surname>
<given-names>W</given-names>
</name>
<name><surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name><surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<year>1990</year>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<volume>215</volume>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="pmid">2231712</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b039"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rychlewski</surname>
<given-names>L</given-names>
</name>
<name><surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name><surname>Li</surname>
<given-names>W</given-names>
</name>
<name><surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Comparison of sequence profiles. Strategies for structural predictions using sequence information</article-title>
<source>Protein Sci</source>
<volume>9</volume>
<fpage>232</fpage>
<lpage>241</lpage>
<pub-id pub-id-type="pmid">10716175</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b040"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name><surname>Schaffer</surname>
<given-names>AA</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
</person-group>
<year>1997</year>
<article-title>Gapped BLAST and PSI-BLAST: A new generation of protein database search programs</article-title>
<source>Nucleic Acids Res</source>
<volume>25</volume>
<fpage>3389</fpage>
<lpage>3402</lpage>
<pub-id pub-id-type="pmid">9254694</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b041"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Durbin</surname>
<given-names>R</given-names>
</name>
<name><surname>Eddy</surname>
<given-names>SR</given-names>
</name>
<name><surname>Krogh</surname>
<given-names>A</given-names>
</name>
<name><surname>Mitchison</surname>
<given-names>G</given-names>
</name>
</person-group>
<year>1998</year>
<source>Biological sequence analysis: Probabilistic models of proteins and nucleic acids</source>
<publisher-loc>New York</publisher-loc>
<publisher-name>Cambridge University Press</publisher-name>
<page-count count="356"></page-count>
</citation>
</ref>
<ref id="pbio-0050016-b042"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barabasi</surname>
<given-names>AL</given-names>
</name>
<name><surname>Albert</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Emergence of scaling in random networks</article-title>
<source>Science</source>
<volume>286</volume>
<fpage>509</fpage>
<lpage>512</lpage>
<pub-id pub-id-type="pmid">10521342</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b043"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barabasi</surname>
<given-names>AL</given-names>
</name>
<name><surname>Oltvai</surname>
<given-names>ZN</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Network biology: Understanding the cell's functional organization</article-title>
<source>Nat Rev Genet</source>
<volume>5</volume>
<fpage>101</fpage>
<lpage>113</lpage>
<pub-id pub-id-type="pmid">14735121</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b044"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
<name><surname>Coleman</surname>
<given-names>ML</given-names>
</name>
<name><surname>Weigele</surname>
<given-names>P</given-names>
</name>
<name><surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<name><surname>Chisholm</surname>
<given-names>SW</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Three <named-content content-type="genus-species">Prochlorococcus cyanophage</named-content>
 genomes: Signature features and ecological interpretations</article-title>
<source>PLoS Biol</source>
<volume>3</volume>
<issue>5</issue>
<elocation-id>e144.</elocation-id>
</citation>
</ref>
<ref id="pbio-0050016-b045"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Giovannoni</surname>
<given-names>SJ</given-names>
</name>
<name><surname>Tripp</surname>
<given-names>HJ</given-names>
</name>
<name><surname>Givan</surname>
<given-names>S</given-names>
</name>
<name><surname>Podar</surname>
<given-names>M</given-names>
</name>
<name><surname>Vergin</surname>
<given-names>KL</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>Genome streamlining in a cosmopolitan oceanic bacterium</article-title>
<source>Science</source>
<volume>309</volume>
<fpage>1242</fpage>
<lpage>1245</lpage>
<pub-id pub-id-type="pmid">16109880</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b046"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Strong</surname>
<given-names>M</given-names>
</name>
<name><surname>Mallick</surname>
<given-names>P</given-names>
</name>
<name><surname>Pellegrini</surname>
<given-names>M</given-names>
</name>
<name><surname>Thompson</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Eisenberg</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Inference of protein function and protein linkages in <italic>Mycobacterium</italic>
 tuberculosis based on prokaryotic genome organization: A combined computational approach</article-title>
<source>Genome Biol</source>
<volume>4</volume>
<fpage>R59</fpage>
<pub-id pub-id-type="pmid">12952538</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b047"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bowers</surname>
<given-names>PM</given-names>
</name>
<name><surname>Pellegrini</surname>
<given-names>M</given-names>
</name>
<name><surname>Thompson</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Fierro</surname>
<given-names>J</given-names>
</name>
<name><surname>Yeates</surname>
<given-names>TO</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Prolinks: A database of protein functional linkages derived from coevolution</article-title>
<source>Genome Biol</source>
<volume>5</volume>
<fpage>R35</fpage>
<pub-id pub-id-type="pmid">15128449</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b048"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>von Mering</surname>
<given-names>C</given-names>
</name>
<name><surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<name><surname>Snel</surname>
<given-names>B</given-names>
</name>
<name><surname>Hooper</surname>
<given-names>SD</given-names>
</name>
<name><surname>Krupp</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>STRING: Known and predicted protein-protein associations, integrated and transferred across organisms</article-title>
<source>Nucleic Acids Res</source>
<volume>33</volume>
<fpage>D433</fpage>
<lpage>D437</lpage>
<pub-id pub-id-type="pmid">15608232</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b049"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<name><surname>Ball</surname>
<given-names>CA</given-names>
</name>
<name><surname>Blake</surname>
<given-names>JA</given-names>
</name>
<name><surname>Botstein</surname>
<given-names>D</given-names>
</name>
<name><surname>Butler</surname>
<given-names>H</given-names>
</name>
<etal></etal>
</person-group>
<year>2000</year>
<article-title>Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium</article-title>
<source>Nat Genet</source>
<volume>25</volume>
<fpage>25</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="pmid">10802651</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b050"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lindell</surname>
<given-names>D</given-names>
</name>
<name><surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>ZI</given-names>
</name>
<name><surname>Tolonen</surname>
<given-names>AC</given-names>
</name>
<name><surname>Rohwer</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Transfer of photosynthesis genes to and from <italic>Prochlorococcus</italic>
 viruses</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>101</volume>
<fpage>11013</fpage>
<lpage>11018</lpage>
<pub-id pub-id-type="pmid">15256601</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b051"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>DeLong</surname>
<given-names>EF</given-names>
</name>
<name><surname>Preston</surname>
<given-names>CM</given-names>
</name>
<name><surname>Mincer</surname>
<given-names>T</given-names>
</name>
<name><surname>Rich</surname>
<given-names>V</given-names>
</name>
<name><surname>Hallam</surname>
<given-names>SJ</given-names>
</name>
<etal></etal>
</person-group>
<year>2006</year>
<article-title>Community genomics among stratified microbial assemblages in the ocean's interior</article-title>
<source>Science</source>
<volume>311</volume>
<fpage>496</fpage>
<lpage>503</lpage>
<pub-id pub-id-type="pmid">16439655</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b052"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paul</surname>
<given-names>JH</given-names>
</name>
<name><surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Marine phage genomics: What have we learned?</article-title>
<source>Curr Opin Biotechnol</source>
<volume>16</volume>
<fpage>299</fpage>
<lpage>307</lpage>
<pub-id pub-id-type="pmid">15961031</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b053"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edwards</surname>
<given-names>RA</given-names>
</name>
<name><surname>Rohwer</surname>
<given-names>F</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Viral metagenomics</article-title>
<source>Nat Rev Microbiol</source>
<volume>3</volume>
<fpage>504</fpage>
<lpage>510</lpage>
<pub-id pub-id-type="pmid">15886693</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b054"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Daubin</surname>
<given-names>V</given-names>
</name>
<name><surname>Ochman</surname>
<given-names>H</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Bacterial genomes as new gene homes: The genealogy of ORFans in <named-content content-type="genus-species">E. coli</named-content>
</article-title>
<source>Genome Res</source>
<volume>14</volume>
<fpage>1036</fpage>
<lpage>1042</lpage>
<pub-id pub-id-type="pmid">15173110</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b055"><citation citation-type="other"><person-group person-group-type="author"><name><surname>Hsiao</surname>
<given-names>WW</given-names>
</name>
<name><surname>Ung</surname>
<given-names>K</given-names>
</name>
<name><surname>Aeschliman</surname>
<given-names>D</given-names>
</name>
<name><surname>Bryan</surname>
<given-names>J</given-names>
</name>
<name><surname>Finlay</surname>
<given-names>BB</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>Evidence of a large novel gene pool associated with prokaryotic genomic islands</article-title>
<source>PLoS Genet</source>
<volume>1</volume>
<issue>5</issue>
<elocation-id>e62.</elocation-id>
</citation>
</ref>
<ref id="pbio-0050016-b056"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Coleman</surname>
<given-names>ML</given-names>
</name>
<name><surname>Sullivan</surname>
<given-names>MB</given-names>
</name>
<name><surname>Martiny</surname>
<given-names>AC</given-names>
</name>
<name><surname>Steglich</surname>
<given-names>C</given-names>
</name>
<name><surname>Barry</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<year>2006</year>
<article-title>Genomic islands and the ecology and evolution of <italic>Prochlorococcus</italic>
</article-title>
<source>Science</source>
<volume>311</volume>
<fpage>1768</fpage>
<lpage>1770</lpage>
<pub-id pub-id-type="pmid">16556843</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b057"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name><surname>Salamon</surname>
<given-names>P</given-names>
</name>
<name><surname>Andresen</surname>
<given-names>B</given-names>
</name>
<name><surname>Mahaffy</surname>
<given-names>JM</given-names>
</name>
<name><surname>Segall</surname>
<given-names>AM</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>Genomic analysis of uncultured marine viral communities</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>99</volume>
<fpage>14250</fpage>
<lpage>14255</lpage>
<pub-id pub-id-type="pmid">12384570</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b058"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pedulla</surname>
<given-names>ML</given-names>
</name>
<name><surname>Ford</surname>
<given-names>ME</given-names>
</name>
<name><surname>Houtz</surname>
<given-names>JM</given-names>
</name>
<name><surname>Karthikeyan</surname>
<given-names>T</given-names>
</name>
<name><surname>Wadsworth</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Origins of highly mosaic mycobacteriophage genomes</article-title>
<source>Cell</source>
<volume>113</volume>
<fpage>171</fpage>
<lpage>182</lpage>
<pub-id pub-id-type="pmid">12705866</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b059"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilson</surname>
<given-names>GA</given-names>
</name>
<name><surname>Bertrand</surname>
<given-names>N</given-names>
</name>
<name><surname>Patel</surname>
<given-names>Y</given-names>
</name>
<name><surname>Hughes</surname>
<given-names>JB</given-names>
</name>
<name><surname>Feil</surname>
<given-names>EJ</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>Orphans as taxonomically restricted and ecologically important genes</article-title>
<source>Microbiology</source>
<volume>151</volume>
<fpage>2499</fpage>
<lpage>2501</lpage>
<pub-id pub-id-type="pmid">16079329</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b060"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Takami</surname>
<given-names>H</given-names>
</name>
<name><surname>Takaki</surname>
<given-names>Y</given-names>
</name>
<name><surname>Uchiyama</surname>
<given-names>I</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Genome sequence of <named-content content-type="genus-species">Oceanobacillus iheyensis</named-content>
 isolated from the Iheya Ridge and its unexpected adaptive capabilities to extreme environments</article-title>
<source>Nucleic Acids Res</source>
<volume>30</volume>
<fpage>3927</fpage>
<lpage>3935</lpage>
<pub-id pub-id-type="pmid">12235376</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b061"><citation citation-type="book"><collab>Wellcome Trust Sanger Institute</collab>
<year>2005</year>
<source>Pfam db [database]. Release 17</source>
<publisher-loc>Cambridge (U.K.)</publisher-loc>
<publisher-name>Wellcome Trust Sanger Institute</publisher-name>
<comment>Available: <ext-link ext-link-type="uri" xlink:href="http://www.sanger.ac.uk/Software/Pfam">http://www.sanger.ac.uk/Software/Pfam</ext-link>
.</comment>
</citation>
</ref>
<ref id="pbio-0050016-b062"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mellor</surname>
<given-names>AL</given-names>
</name>
<name><surname>Munn</surname>
<given-names>DH</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>IDO expression by dendritic cells: Tolerance and tryptophan catabolism</article-title>
<source>Nat Rev Immunol</source>
<volume>4</volume>
<fpage>762</fpage>
<lpage>774</lpage>
<pub-id pub-id-type="pmid">15459668</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b063"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suzuki</surname>
<given-names>T</given-names>
</name>
<name><surname>Yokouchi</surname>
<given-names>K</given-names>
</name>
<name><surname>Kawamichi</surname>
<given-names>H</given-names>
</name>
<name><surname>Yamamoto</surname>
<given-names>Y</given-names>
</name>
<name><surname>Uda</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Comparison of the sequences of Turbo and Sulculus indoleamine dioxygenase-like myoglobin genes</article-title>
<source>Gene</source>
<volume>308</volume>
<fpage>89</fpage>
<lpage>94</lpage>
<pub-id pub-id-type="pmid">12711393</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b064"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fallarino</surname>
<given-names>F</given-names>
</name>
<name><surname>Asselin-Paturel</surname>
<given-names>C</given-names>
</name>
<name><surname>Vacca</surname>
<given-names>C</given-names>
</name>
<name><surname>Bianchi</surname>
<given-names>R</given-names>
</name>
<name><surname>Gizzi</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Murine plasmacytoid dendritic cells initiate the immunosuppressive pathway of tryptophan catabolism in response to CD200 receptor engagement</article-title>
<source>J Immunol</source>
<volume>173</volume>
<fpage>3748</fpage>
<lpage>3754</lpage>
<pub-id pub-id-type="pmid">15356121</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b065"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hayashi</surname>
<given-names>T</given-names>
</name>
<name><surname>Beck</surname>
<given-names>L</given-names>
</name>
<name><surname>Rossetto</surname>
<given-names>C</given-names>
</name>
<name><surname>Gong</surname>
<given-names>X</given-names>
</name>
<name><surname>Takikawa</surname>
<given-names>O</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Inhibition of experimental asthma by indoleamine 2,3-dioxygenase</article-title>
<source>J Clin Invest</source>
<volume>114</volume>
<fpage>270</fpage>
<lpage>279</lpage>
<pub-id pub-id-type="pmid">15254594</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b066"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Muller</surname>
<given-names>AJ</given-names>
</name>
<name><surname>DuHadaway</surname>
<given-names>JB</given-names>
</name>
<name><surname>Donover</surname>
<given-names>PS</given-names>
</name>
<name><surname>Sutanto-Ward</surname>
<given-names>E</given-names>
</name>
<name><surname>Prendergast</surname>
<given-names>GC</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Inhibition of indoleamine 2,3-dioxygenase, an immunoregulatory target of the cancer suppression gene <italic>Bin1,</italic>
 potentiates cancer chemotherapy</article-title>
<source>Nat Med</source>
<volume>11</volume>
<fpage>312</fpage>
<lpage>319</lpage>
<pub-id pub-id-type="pmid">15711557</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b067"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Burley</surname>
<given-names>SK</given-names>
</name>
<name><surname>Bonanno</surname>
<given-names>JB</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Structural genomics</article-title>
<source>Methods Biochem Anal</source>
<volume>44</volume>
<fpage>591</fpage>
<lpage>612</lpage>
<pub-id pub-id-type="pmid">12647406</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b068"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blundell</surname>
<given-names>TL</given-names>
</name>
<name><surname>Mizuguchi</surname>
<given-names>K</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Structural genomics: An overview</article-title>
<source>Prog Biophys Mol Biol</source>
<volume>73</volume>
<fpage>289</fpage>
<lpage>295</lpage>
<pub-id pub-id-type="pmid">11063776</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b069"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brenner</surname>
<given-names>SE</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>A tour of structural genomics</article-title>
<source>Nat Rev Genet</source>
<volume>2</volume>
<fpage>801</fpage>
<lpage>809</lpage>
<pub-id pub-id-type="pmid">11584296</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b070"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Montelione</surname>
<given-names>GT</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Structural genomics: An approach to the protein folding problem</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>98</volume>
<fpage>13488</fpage>
<lpage>13489</lpage>
<pub-id pub-id-type="pmid">11717420</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b071"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chance</surname>
<given-names>MR</given-names>
</name>
<name><surname>Bresnick</surname>
<given-names>AR</given-names>
</name>
<name><surname>Burley</surname>
<given-names>SK</given-names>
</name>
<name><surname>Jiang</surname>
<given-names>JS</given-names>
</name>
<name><surname>Lima</surname>
<given-names>CD</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>Structural genomics: A pipeline for providing structures for the biologist</article-title>
<source>Protein Sci</source>
<volume>11</volume>
<fpage>723</fpage>
<lpage>738</lpage>
<pub-id pub-id-type="pmid">11910018</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b072"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chandonia</surname>
<given-names>JM</given-names>
</name>
<name><surname>Brenner</surname>
<given-names>SE</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>The impact of structural genomics: expectations and outcomes</article-title>
<source>Science</source>
<volume>311</volume>
<fpage>347</fpage>
<lpage>351</lpage>
<pub-id pub-id-type="pmid">16424331</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b073"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chandonia</surname>
<given-names>JM</given-names>
</name>
<name><surname>Brenner</surname>
<given-names>SE</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches</article-title>
<source>Proteins</source>
<volume>58</volume>
<fpage>166</fpage>
<lpage>179</lpage>
<pub-id pub-id-type="pmid">15521074</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b074"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chandonia</surname>
<given-names>JM</given-names>
</name>
<name><surname>Brenner</surname>
<given-names>SE</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Update on the Pfam5000 strategy for selection of structural genomics targets</article-title>
<source>Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference</source>
<conf-loc>Shanghai, China</conf-loc>
<volume>27</volume>
<fpage>751</fpage>
<lpage>755</lpage>
</citation>
</ref>
<ref id="pbio-0050016-b075"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baker</surname>
<given-names>D</given-names>
</name>
<name><surname>Sali</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Protein structure prediction and structural genomics</article-title>
<source>Science</source>
<volume>294</volume>
<fpage>93</fpage>
<lpage>96</lpage>
<pub-id pub-id-type="pmid">11588250</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b076"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Service</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Structural biology. Structural genomics, round 2</article-title>
<source>Science</source>
<volume>307</volume>
<fpage>1554</fpage>
<lpage>1558</lpage>
<pub-id pub-id-type="pmid">15761136</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b077"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kannan</surname>
<given-names>N</given-names>
</name>
<name><surname>Taylor</surname>
<given-names>SS</given-names>
</name>
<name><surname>Zhai</surname>
<given-names>Y</given-names>
</name>
<name><surname>Venter</surname>
<given-names>JC</given-names>
</name>
<name><surname>Manning</surname>
<given-names>G</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Structural and functional diversity of the microbial kinome</article-title>
<source>PLoS Biol</source>
<volume>5</volume>
<fpage>e17</fpage>
<comment>doi:<ext-link ext-link-type="doi" xlink:href="10.1371/journal.pbio.0050017">10.1371/journal.pbio.0050017</ext-link>
</comment>
<pub-id pub-id-type="pmid">17355172</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b078"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Friedberg</surname>
<given-names>E</given-names>
</name>
</person-group>
<year>1985</year>
<source>DNA repair</source>
<publisher-loc>New York</publisher-loc>
<publisher-name>W. H. Freeman and Co</publisher-name>
<page-count count="614"></page-count>
</citation>
</ref>
<ref id="pbio-0050016-b079"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sancar</surname>
<given-names>GB</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Enzymatic photoreactivation: 50 years and counting</article-title>
<source>Mutat Res</source>
<volume>451</volume>
<fpage>25</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="pmid">10915863</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b080"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bowman</surname>
<given-names>KK</given-names>
</name>
<name><surname>Sidik</surname>
<given-names>K</given-names>
</name>
<name><surname>Smith</surname>
<given-names>CA</given-names>
</name>
<name><surname>Taylor</surname>
<given-names>JS</given-names>
</name>
<name><surname>Doetsch</surname>
<given-names>PW</given-names>
</name>
<etal></etal>
</person-group>
<year>1994</year>
<article-title>A new ATP-independent DNA endonuclease from <named-content content-type="genus-species">Schizosaccharomyces pombe</named-content>
 that recognizes cyclobutane pyrimidine dimers and 6–4 photoproducts</article-title>
<source>Nucleic Acids Res</source>
<volume>22</volume>
<fpage>3026</fpage>
<lpage>3032</lpage>
<pub-id pub-id-type="pmid">8065916</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b081"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Setlow</surname>
<given-names>P</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Resistance of spores of <italic>Bacillus</italic>
 species to ultraviolet light</article-title>
<source>Environ Mol Mutagen</source>
<volume>38</volume>
<fpage>97</fpage>
<lpage>104</lpage>
<pub-id pub-id-type="pmid">11746741</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b082"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Morikawa</surname>
<given-names>K</given-names>
</name>
<name><surname>Ariyoshi</surname>
<given-names>M</given-names>
</name>
<name><surname>Vassylyev</surname>
<given-names>D</given-names>
</name>
<name><surname>Katayanagi</surname>
<given-names>K</given-names>
</name>
<name><surname>Nakamura</surname>
<given-names>H</given-names>
</name>
<etal></etal>
</person-group>
<year>1994</year>
<article-title>Crystal structure of T4 endonuclease V. An excision repair enzyme for a pyrimidine dimer</article-title>
<source>Ann N Y Acad Sci</source>
<volume>726</volume>
<fpage>198</fpage>
<lpage>207</lpage>
<pub-id pub-id-type="pmid">8092676</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b083"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Piersen</surname>
<given-names>CE</given-names>
</name>
<name><surname>Prince</surname>
<given-names>MA</given-names>
</name>
<name><surname>Augustine</surname>
<given-names>ML</given-names>
</name>
<name><surname>Dodson</surname>
<given-names>ML</given-names>
</name>
<name><surname>Lloyd</surname>
<given-names>RS</given-names>
</name>
</person-group>
<year>1995</year>
<article-title>Purification and cloning of <named-content content-type="genus-species">Micrococcus luteus</named-content>
 ultraviolet endonuclease, an N-glycosylase/abasic lyase that proceeds via an imino enzyme-DNA intermediate</article-title>
<source>J Biol Chem</source>
<volume>270</volume>
<fpage>23475</fpage>
<lpage>23484</lpage>
<pub-id pub-id-type="pmid">7559510</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b084"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hunter</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>1995</year>
<article-title>Protein kinases and phosphatases: The yin and yang of protein phosphorylation and signaling</article-title>
<source>Cell</source>
<volume>80</volume>
<fpage>225</fpage>
<lpage>236</lpage>
<pub-id pub-id-type="pmid">7834742</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b085"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kennelly</surname>
<given-names>PJ</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Protein phosphatases—A phylogenetic perspective</article-title>
<source>Chem Rev</source>
<volume>101</volume>
<fpage>2291</fpage>
<lpage>2312</lpage>
<pub-id pub-id-type="pmid">11749374</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b086"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Leroy</surname>
<given-names>C</given-names>
</name>
<name><surname>Lee</surname>
<given-names>SE</given-names>
</name>
<name><surname>Vaze</surname>
<given-names>MB</given-names>
</name>
<name><surname>Ochsenbien</surname>
<given-names>F</given-names>
</name>
<name><surname>Guerois</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>PP2C phosphatases Ptc2 and Ptc3 are required for DNA checkpoint inactivation after a double-strand break</article-title>
<source>Mol Cell</source>
<volume>11</volume>
<fpage>827</fpage>
<lpage>835</lpage>
<pub-id pub-id-type="pmid">12667463</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b087"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meskiene</surname>
<given-names>I</given-names>
</name>
<name><surname>Baudouin</surname>
<given-names>E</given-names>
</name>
<name><surname>Schweighofer</surname>
<given-names>A</given-names>
</name>
<name><surname>Liwosz</surname>
<given-names>A</given-names>
</name>
<name><surname>Jonak</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Stress-induced protein phosphatase 2C is a negative regulator of a mitogen-activated protein kinase</article-title>
<source>J Biol Chem</source>
<volume>278</volume>
<fpage>18945</fpage>
<lpage>18952</lpage>
<pub-id pub-id-type="pmid">12646559</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b088"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Takekawa</surname>
<given-names>M</given-names>
</name>
<name><surname>Maeda</surname>
<given-names>T</given-names>
</name>
<name><surname>Saito</surname>
<given-names>H</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Protein phosphatase 2Calpha inhibits the human stress-responsive p38 and JNK MAPK pathways</article-title>
<source>EMBO J</source>
<volume>17</volume>
<fpage>4744</fpage>
<lpage>4752</lpage>
<pub-id pub-id-type="pmid">9707433</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b089"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Warmka</surname>
<given-names>J</given-names>
</name>
<name><surname>Hanneman</surname>
<given-names>J</given-names>
</name>
<name><surname>Lee</surname>
<given-names>J</given-names>
</name>
<name><surname>Amin</surname>
<given-names>D</given-names>
</name>
<name><surname>Ota</surname>
<given-names>I</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Ptc1, a type 2C Ser/Thr phosphatase, inactivates the HOG pathway by dephosphorylating the mitogen-activated protein kinase Hog1</article-title>
<source>Mol Cell Biol</source>
<volume>21</volume>
<fpage>51</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="pmid">11113180</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b090"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bork</surname>
<given-names>P</given-names>
</name>
<name><surname>Brown</surname>
<given-names>NP</given-names>
</name>
<name><surname>Hegyi</surname>
<given-names>H</given-names>
</name>
<name><surname>Schultz</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>1996</year>
<article-title>The protein phosphatase 2C (PP2C) superfamily: Detection of bacterial homologues</article-title>
<source>Protein Sci</source>
<volume>5</volume>
<fpage>1421</fpage>
<lpage>1425</lpage>
<pub-id pub-id-type="pmid">8819174</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b091"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Das</surname>
<given-names>AK</given-names>
</name>
<name><surname>Helps</surname>
<given-names>NR</given-names>
</name>
<name><surname>Cohen</surname>
<given-names>PT</given-names>
</name>
<name><surname>Barford</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>1996</year>
<article-title>Crystal structure of the protein serine/threonine phosphatase 2C at 2.0 A resolution</article-title>
<source>EMBO J</source>
<volume>15</volume>
<fpage>6798</fpage>
<lpage>6809</lpage>
<pub-id pub-id-type="pmid">9003755</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b092"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jackson</surname>
<given-names>MD</given-names>
</name>
<name><surname>Fjeld</surname>
<given-names>CC</given-names>
</name>
<name><surname>Denu</surname>
<given-names>JM</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Probing the function of conserved residues in the serine/threonine phosphatase PP2Calpha</article-title>
<source>Biochemistry</source>
<volume>42</volume>
<fpage>8513</fpage>
<lpage>8521</lpage>
<pub-id pub-id-type="pmid">12859198</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b093"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Novakova</surname>
<given-names>L</given-names>
</name>
<name><surname>Saskova</surname>
<given-names>L</given-names>
</name>
<name><surname>Pallova</surname>
<given-names>P</given-names>
</name>
<name><surname>Janecek</surname>
<given-names>J</given-names>
</name>
<name><surname>Novotna</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<year>2005</year>
<article-title>Characterization of a eukaryotic type serine/threonine protein kinase and protein phosphatase of <named-content content-type="genus-species">Streptococcus pneumoniae</named-content>
 and identification of kinase substrates</article-title>
<source>FEBS J</source>
<volume>272</volume>
<fpage>1243</fpage>
<lpage>1254</lpage>
<pub-id pub-id-type="pmid">15720398</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b094"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Obuchowski</surname>
<given-names>M</given-names>
</name>
<name><surname>Madec</surname>
<given-names>E</given-names>
</name>
<name><surname>Delattre</surname>
<given-names>D</given-names>
</name>
<name><surname>Boel</surname>
<given-names>G</given-names>
</name>
<name><surname>Iwanicki</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<year>2000</year>
<article-title>Characterization of PrpC from <italic>Bacillus subtilis,</italic>
 a member of the PPM phosphatase family</article-title>
<source>J Bacteriol</source>
<volume>182</volume>
<fpage>5634</fpage>
<lpage>5638</lpage>
<pub-id pub-id-type="pmid">10986276</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b095"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boitel</surname>
<given-names>B</given-names>
</name>
<name><surname>Ortiz-Lombardia</surname>
<given-names>M</given-names>
</name>
<name><surname>Duran</surname>
<given-names>R</given-names>
</name>
<name><surname>Pompeo</surname>
<given-names>F</given-names>
</name>
<name><surname>Cole</surname>
<given-names>ST</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>PknB kinase activity is regulated by phosphorylation in two Thr residues and dephosphorylation by PstP, the cognate phospho-Ser/Thr phosphatase, in <italic>Mycobacterium</italic>
 tuberculosis</article-title>
<source>Mol Microbiol</source>
<volume>49</volume>
<fpage>1493</fpage>
<lpage>1508</lpage>
<pub-id pub-id-type="pmid">12950916</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b096"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chopra</surname>
<given-names>P</given-names>
</name>
<name><surname>Singh</surname>
<given-names>B</given-names>
</name>
<name><surname>Singh</surname>
<given-names>R</given-names>
</name>
<name><surname>Vohra</surname>
<given-names>R</given-names>
</name>
<name><surname>Koul</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Phosphoprotein phosphatase of <italic>Mycobacterium</italic>
 tuberculosis dephosphorylates serine-threonine kinases PknA and PknB</article-title>
<source>Biochem Biophys Res Commun</source>
<volume>311</volume>
<fpage>112</fpage>
<lpage>120</lpage>
<pub-id pub-id-type="pmid">14575702</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b097"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yeats</surname>
<given-names>C</given-names>
</name>
<name><surname>Finn</surname>
<given-names>RD</given-names>
</name>
<name><surname>Bateman</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>The PASTA domain: A beta-lactam-binding domain</article-title>
<source>Trends Biochem Sci</source>
<volume>27</volume>
<fpage>438</fpage>
<pub-id pub-id-type="pmid">12217513</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b098"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schweighofer</surname>
<given-names>A</given-names>
</name>
<name><surname>Hirt</surname>
<given-names>H</given-names>
</name>
<name><surname>Meskiene</surname>
<given-names>I</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Plant PP2C phosphatases: Emerging functions in stress signaling</article-title>
<source>Trends Plant Sci</source>
<volume>9</volume>
<fpage>236</fpage>
<lpage>243</lpage>
<pub-id pub-id-type="pmid">15130549</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b099"><citation citation-type="book"><person-group person-group-type="editor"><name><surname>Barrett</surname>
<given-names>AJ</given-names>
</name>
<name><surname>Rawlings</surname>
<given-names>ND</given-names>
</name>
<name><surname>Woesner</surname>
<given-names>JF</given-names>
</name>
</person-group>
<year>2004</year>
<source>Handbook of proteolytic enzymes</source>
<publisher-loc>Amsterdam</publisher-loc>
<publisher-name>Elsevier</publisher-name>
<page-count count="2140"></page-count>
</citation>
</ref>
<ref id="pbio-0050016-b100"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rawlings</surname>
<given-names>ND</given-names>
</name>
<name><surname>Morton</surname>
<given-names>FR</given-names>
</name>
<name><surname>Barrett</surname>
<given-names>AJ</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>MEROPS: The peptidase database</article-title>
<source>Nucleic Acids Res</source>
<volume>34</volume>
<fpage>D270</fpage>
<lpage>D272</lpage>
<pub-id pub-id-type="pmid">16381862</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b101"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kumada</surname>
<given-names>Y</given-names>
</name>
<name><surname>Benson</surname>
<given-names>DR</given-names>
</name>
<name><surname>Hillemann</surname>
<given-names>D</given-names>
</name>
<name><surname>Hosted</surname>
<given-names>TJ</given-names>
</name>
<name><surname>Rochefort</surname>
<given-names>DA</given-names>
</name>
<etal></etal>
</person-group>
<year>1993</year>
<article-title>Evolution of the glutamine synthetase gene, one of the oldest existing and functioning genes</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>90</volume>
<fpage>3009</fpage>
<lpage>3013</lpage>
<pub-id pub-id-type="pmid">8096645</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b102"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Valentine</surname>
<given-names>RC</given-names>
</name>
<name><surname>Shapiro</surname>
<given-names>BM</given-names>
</name>
<name><surname>Stadtman</surname>
<given-names>ER</given-names>
</name>
</person-group>
<year>1968</year>
<article-title>Regulation of glutamine synthetase. XII. Electron microscopy of the enzyme from <named-content content-type="genus-species">Escherichia coli</named-content>
</article-title>
<source>Biochemistry</source>
<volume>7</volume>
<fpage>2143</fpage>
<lpage>2152</lpage>
<pub-id pub-id-type="pmid">4873173</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b103"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Almassy</surname>
<given-names>RJ</given-names>
</name>
<name><surname>Janson</surname>
<given-names>CA</given-names>
</name>
<name><surname>Hamlin</surname>
<given-names>R</given-names>
</name>
<name><surname>Xuong</surname>
<given-names>NH</given-names>
</name>
<name><surname>Eisenberg</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>1986</year>
<article-title>Novel subunit-subunit interactions in the structure of glutamine synthetase</article-title>
<source>Nature</source>
<volume>323</volume>
<fpage>304</fpage>
<lpage>309</lpage>
<pub-id pub-id-type="pmid">2876389</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b104"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eisenberg</surname>
<given-names>D</given-names>
</name>
<name><surname>Gill</surname>
<given-names>HS</given-names>
</name>
<name><surname>Pfluegl</surname>
<given-names>GM</given-names>
</name>
<name><surname>Rotstein</surname>
<given-names>SH</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Structure-function relationships of glutamine synthetases</article-title>
<source>Biochim Biophys Acta</source>
<volume>1477</volume>
<fpage>122</fpage>
<lpage>145</lpage>
<pub-id pub-id-type="pmid">10708854</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b105"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eddy</surname>
<given-names>SR</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Profile hidden Markov models</article-title>
<source>Bioinformatics</source>
<volume>14</volume>
<fpage>755</fpage>
<lpage>763</lpage>
<pub-id pub-id-type="pmid">9918945</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b106"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carlson</surname>
<given-names>T</given-names>
</name>
<name><surname>Chelm</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>1986</year>
<article-title>Apparant eukaryotic origin of glutamine synthetase II from the bacterium <named-content content-type="genus-species">Bradyrhizobium japonicum</named-content>
</article-title>
<source>Nature</source>
<volume>322</volume>
<fpage>568</fpage>
<lpage>570</lpage>
</citation>
</ref>
<ref id="pbio-0050016-b107"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hosted</surname>
<given-names>TJ</given-names>
</name>
<name><surname>Rochefort</surname>
<given-names>DA</given-names>
</name>
<name><surname>Benson</surname>
<given-names>DR</given-names>
</name>
</person-group>
<year>1993</year>
<article-title>Close linkage of genes encoding glutamine synthetases I and II in <named-content content-type="genus-species">Frankia alni</named-content>
 CpI1</article-title>
<source>J Bacteriol</source>
<volume>175</volume>
<fpage>3679</fpage>
<lpage>3684</lpage>
<pub-id pub-id-type="pmid">8099074</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b108"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deuel</surname>
<given-names>TF</given-names>
</name>
<name><surname>Ginsburg</surname>
<given-names>A</given-names>
</name>
<name><surname>Yeh</surname>
<given-names>J</given-names>
</name>
<name><surname>Shelton</surname>
<given-names>E</given-names>
</name>
<name><surname>Stadtman</surname>
<given-names>ER</given-names>
</name>
</person-group>
<year>1970</year>
<article-title><named-content content-type="genus-species">Bacillus subtilis</named-content>
 glutamine synthetase. Purification and physical characterization</article-title>
<source>J Biol Chem</source>
<volume>245</volume>
<fpage>5195</fpage>
<lpage>5205</lpage>
<pub-id pub-id-type="pmid">4990297</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b109"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fisher</surname>
<given-names>SH</given-names>
</name>
<name><surname>Sonenshein</surname>
<given-names>AL</given-names>
</name>
</person-group>
<year>1984</year>
<article-title><named-content content-type="genus-species">Bacillus subtilis</named-content>
 glutamine synthetase mutants pleiotropically altered in glucose catabolite repression</article-title>
<source>J Bacteriol</source>
<volume>157</volume>
<fpage>612</fpage>
<lpage>621</lpage>
<pub-id pub-id-type="pmid">6141156</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b110"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ellis</surname>
<given-names>RJ</given-names>
</name>
</person-group>
<year>1979</year>
<article-title>The most abundant protein in the world</article-title>
<source>Trends Biochem Sci</source>
<volume>4</volume>
<fpage>241</fpage>
<lpage>244</lpage>
</citation>
</ref>
<ref id="pbio-0050016-b111"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hanson</surname>
<given-names>TE</given-names>
</name>
<name><surname>Tabita</surname>
<given-names>FR</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>A ribulose-1,5-bisphosphate carboxylase/oxygenase (RubisCO)-like protein from <named-content content-type="genus-species">Chlorobium tepidum</named-content>
 that is involved with sulfur metabolism and the response to oxidative stress</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>98</volume>
<fpage>4397</fpage>
<lpage>4402</lpage>
<pub-id pub-id-type="pmid">11287671</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b112"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eisen</surname>
<given-names>JA</given-names>
</name>
<name><surname>Nelson</surname>
<given-names>KE</given-names>
</name>
<name><surname>Paulsen</surname>
<given-names>IT</given-names>
</name>
<name><surname>Heidelberg</surname>
<given-names>JF</given-names>
</name>
<name><surname>Wu</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>The complete genome sequence of <named-content content-type="genus-species">Chlorobium tepidum</named-content>
 TLS, a photosynthetic, anaerobic, green-sulfur bacterium</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>99</volume>
<fpage>9509</fpage>
<lpage>9514</lpage>
<pub-id pub-id-type="pmid">12093901</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b113"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>H</given-names>
</name>
<name><surname>Sawaya</surname>
<given-names>MR</given-names>
</name>
<name><surname>Tabita</surname>
<given-names>FR</given-names>
</name>
<name><surname>Eisenberg</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Crystal structure of a RuBisCO-like protein from the green sulfur bacterium <named-content content-type="genus-species">Chlorobium tepidum</named-content>
</article-title>
<source>Structure (Camb)</source>
<volume>13</volume>
<fpage>779</fpage>
<lpage>789</lpage>
<pub-id pub-id-type="pmid">15893668</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b114"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ashida</surname>
<given-names>H</given-names>
</name>
<name><surname>Saito</surname>
<given-names>Y</given-names>
</name>
<name><surname>Kojima</surname>
<given-names>C</given-names>
</name>
<name><surname>Kobayashi</surname>
<given-names>K</given-names>
</name>
<name><surname>Ogasawara</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>A functional link between RuBisCO-like protein of <italic>Bacillus</italic>
 and photosynthetic RuBisCO</article-title>
<source>Science</source>
<volume>302</volume>
<fpage>286</fpage>
<lpage>290</lpage>
<pub-id pub-id-type="pmid">14551435</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b115"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fischer</surname>
<given-names>D</given-names>
</name>
<name><surname>Eisenberg</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Finding families for genomic ORFans</article-title>
<source>Bioinformatics</source>
<volume>15</volume>
<fpage>759</fpage>
<lpage>762</lpage>
<pub-id pub-id-type="pmid">10498776</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b116"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>W</given-names>
</name>
<name><surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name><surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Clustering of highly homologous sequences to reduce the size of large protein databases</article-title>
<source>Bioinformatics</source>
<volume>17</volume>
<fpage>282</fpage>
<lpage>283</lpage>
<pub-id pub-id-type="pmid">11294794</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b117"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>W</given-names>
</name>
<name><surname>Jaroszewski</surname>
<given-names>L</given-names>
</name>
<name><surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Tolerating some redundancy significantly speeds up clustering of large protein databases</article-title>
<source>Bioinformatics</source>
<volume>18</volume>
<fpage>77</fpage>
<lpage>82</lpage>
<pub-id pub-id-type="pmid">11836214</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b118"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bujnicki</surname>
<given-names>JM</given-names>
</name>
<name><surname>Rychlewski</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Identification of a PD-(D/E)XK-like domain with a novel configuration of the endonuclease active site in the methyl-directed restriction enzyme Mrr and its homologs</article-title>
<source>Gene</source>
<volume>267</volume>
<fpage>183</fpage>
<lpage>191</lpage>
<pub-id pub-id-type="pmid">11313145</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b119"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name><surname>Felts</surname>
<given-names>B</given-names>
</name>
<name><surname>Kelley</surname>
<given-names>S</given-names>
</name>
<name><surname>Mahaffy</surname>
<given-names>JM</given-names>
</name>
<name><surname>Nulton</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>Diversity and population structure of a near-shore marine-sediment viral community</article-title>
<source>Proc Biol Sci</source>
<volume>271</volume>
<fpage>565</fpage>
<lpage>574</lpage>
<pub-id pub-id-type="pmid">15156913</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b120"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breitbart</surname>
<given-names>M</given-names>
</name>
<name><surname>Hewson</surname>
<given-names>I</given-names>
</name>
<name><surname>Felts</surname>
<given-names>B</given-names>
</name>
<name><surname>Mahaffy</surname>
<given-names>JM</given-names>
</name>
<name><surname>Nulton</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>Metagenomic analyses of an uncultured viral community from human feces</article-title>
<source>J Bacteriol</source>
<volume>185</volume>
<fpage>6220</fpage>
<lpage>6223</lpage>
<pub-id pub-id-type="pmid">14526037</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b121"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cann</surname>
<given-names>AJ</given-names>
</name>
<name><surname>Fandrich</surname>
<given-names>SE</given-names>
</name>
<name><surname>Heaphy</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes</article-title>
<source>Virus Genes</source>
<volume>30</volume>
<fpage>151</fpage>
<lpage>156</lpage>
<pub-id pub-id-type="pmid">15744573</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b122"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Boeckmann</surname>
<given-names>B</given-names>
</name>
<name><surname>Bairoch</surname>
<given-names>A</given-names>
</name>
<name><surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<name><surname>Blatter</surname>
<given-names>MC</given-names>
</name>
<name><surname>Estreicher</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>365</fpage>
<lpage>370</lpage>
<pub-id pub-id-type="pmid">12520024</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b123"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Westbrook</surname>
<given-names>J</given-names>
</name>
<name><surname>Feng</surname>
<given-names>Z</given-names>
</name>
<name><surname>Chen</surname>
<given-names>L</given-names>
</name>
<name><surname>Yang</surname>
<given-names>H</given-names>
</name>
<name><surname>Berman</surname>
<given-names>HM</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>The Protein Data Bank and structural genomics</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>489</fpage>
<lpage>491</lpage>
<pub-id pub-id-type="pmid">12520059</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b124"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname>
<given-names>CH</given-names>
</name>
<name><surname>Yeh</surname>
<given-names>LS</given-names>
</name>
<name><surname>Huang</surname>
<given-names>H</given-names>
</name>
<name><surname>Arminski</surname>
<given-names>L</given-names>
</name>
<name><surname>Castro-Alvear</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>The Protein Information Resource</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>345</fpage>
<lpage>347</lpage>
<pub-id pub-id-type="pmid">12520019</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b125"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Benson</surname>
<given-names>DA</given-names>
</name>
<name><surname>Karsch-Mizrachi</surname>
<given-names>I</given-names>
</name>
<name><surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Ostell</surname>
<given-names>J</given-names>
</name>
<name><surname>Wheeler</surname>
<given-names>DL</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>GenBank</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>23</fpage>
<lpage>27</lpage>
<pub-id pub-id-type="pmid">12519940</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b126"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stoesser</surname>
<given-names>G</given-names>
</name>
<name><surname>Baker</surname>
<given-names>W</given-names>
</name>
<name><surname>van den Broek</surname>
<given-names>A</given-names>
</name>
<name><surname>Garcia-Pastor</surname>
<given-names>M</given-names>
</name>
<name><surname>Kanz</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<year>2003</year>
<article-title>The EMBL Nucleotide Sequence Database: Major new developments</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>17</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="pmid">12519939</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b127"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miyazaki</surname>
<given-names>S</given-names>
</name>
<name><surname>Sugawara</surname>
<given-names>H</given-names>
</name>
<name><surname>Gojobori</surname>
<given-names>T</given-names>
</name>
<name><surname>Tateno</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>DNA Data Bank of Japan (DDBJ) in XML</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>13</fpage>
<lpage>16</lpage>
<pub-id pub-id-type="pmid">12519938</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b128"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Celniker</surname>
<given-names>SE</given-names>
</name>
<name><surname>Wheeler</surname>
<given-names>DA</given-names>
</name>
<name><surname>Kronmiller</surname>
<given-names>B</given-names>
</name>
<name><surname>Carlson</surname>
<given-names>JW</given-names>
</name>
<name><surname>Halpern</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>Finishing a whole-genome shotgun: Release 3 of the <named-content content-type="genus-species">Drosophila melanogaster</named-content>
 euchromatic genome sequence</article-title>
<source>Genome Biol</source>
<volume>3</volume>
<fpage>RESEARCH0079</fpage>
<pub-id pub-id-type="pmid">12537568</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b129"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Henikoff</surname>
<given-names>S</given-names>
</name>
<name><surname>Henikoff</surname>
<given-names>JG</given-names>
</name>
</person-group>
<year>1992</year>
<article-title>Amino acid substitution matrices from protein blocks</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>89</volume>
<fpage>10915</fpage>
<lpage>10919</lpage>
<pub-id pub-id-type="pmid">1438297</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b130"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ochman</surname>
<given-names>H</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Distinguishing the ORFs from the ELFs: Short bacterial genes and the annotation of genomes</article-title>
<source>Trends Genet</source>
<volume>18</volume>
<fpage>335</fpage>
<lpage>337</lpage>
<pub-id pub-id-type="pmid">12127765</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b131"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nekrutenko</surname>
<given-names>A</given-names>
</name>
<name><surname>Makova</surname>
<given-names>KD</given-names>
</name>
<name><surname>Li</surname>
<given-names>WH</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: An empirical and simulation study</article-title>
<source>Genome Res</source>
<volume>12</volume>
<fpage>198</fpage>
<lpage>202</lpage>
<pub-id pub-id-type="pmid">11779845</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b132"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>WH</given-names>
</name>
</person-group>
<year>1997</year>
<source>Molecular Evolution</source>
<publisher-loc>Sunderland (MA)</publisher-loc>
<publisher-name>Sinauer Associates, Inc</publisher-name>
<page-count count="487"></page-count>
</citation>
</ref>
<ref id="pbio-0050016-b133"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Nei</surname>
<given-names>M</given-names>
</name>
<name><surname>Kumar</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2000</year>
<source>Molecular evolution and phylogenetics</source>
<publisher-loc>New York</publisher-loc>
<publisher-name>Oxford University Press</publisher-name>
<page-count count="333"></page-count>
</citation>
</ref>
<ref id="pbio-0050016-b134"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>MUSCLE: Multiple sequence alignment with high accuracy and high throughput</article-title>
<source>Nucleic Acids Res</source>
<volume>32</volume>
<fpage>1792</fpage>
<lpage>1797</lpage>
<pub-id pub-id-type="pmid">15034147</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b135"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname>
<given-names>Z</given-names>
</name>
</person-group>
<year>1997</year>
<article-title>PAML: A program package for phylogenetic analysis by maximum likelihood</article-title>
<source>Comput Appl Biosci</source>
<volume>13</volume>
<fpage>555</fpage>
<lpage>556</lpage>
<pub-id pub-id-type="pmid">9367129</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b136"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname>
<given-names>Z</given-names>
</name>
<name><surname>Nielsen</surname>
<given-names>R</given-names>
</name>
<name><surname>Goldman</surname>
<given-names>N</given-names>
</name>
<name><surname>Pedersen</surname>
<given-names>AM</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Codon-substitution models for heterogeneous selection pressure at amino acid sites</article-title>
<source>Genetics</source>
<volume>155</volume>
<fpage>431</fpage>
<lpage>449</lpage>
<pub-id pub-id-type="pmid">10790415</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b137"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huynen</surname>
<given-names>MA</given-names>
</name>
<name><surname>van Nimwegen</surname>
<given-names>E</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>The frequency distribution of gene family sizes in complete genomes</article-title>
<source>Mol Biol Evol</source>
<volume>15</volume>
<fpage>583</fpage>
<lpage>589</lpage>
<pub-id pub-id-type="pmid">9580988</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b138"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yanai</surname>
<given-names>I</given-names>
</name>
<name><surname>Camacho</surname>
<given-names>CJ</given-names>
</name>
<name><surname>DeLisi</surname>
<given-names>C</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Predictions of gene family distributions in microbial genomes: Evolution by gene duplication and modification</article-title>
<source>Phys Rev Lett</source>
<volume>85</volume>
<fpage>2641</fpage>
<lpage>2644</lpage>
<pub-id pub-id-type="pmid">10978127</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b139"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qian</surname>
<given-names>J</given-names>
</name>
<name><surname>Luscombe</surname>
<given-names>NM</given-names>
</name>
<name><surname>Gerstein</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Protein family and fold occurrence in genomes: Power-law behaviour and evolutionary model</article-title>
<source>J Mol Biol</source>
<volume>313</volume>
<fpage>673</fpage>
<lpage>681</lpage>
<pub-id pub-id-type="pmid">11697896</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b140"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Unger</surname>
<given-names>R</given-names>
</name>
<name><surname>Uliel</surname>
<given-names>S</given-names>
</name>
<name><surname>Havlin</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Scaling law in sizes of protein sequence families: From super-families to orphan genes</article-title>
<source>Proteins</source>
<volume>51</volume>
<fpage>569</fpage>
<lpage>576</lpage>
<pub-id pub-id-type="pmid">12784216</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b141"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Salgado</surname>
<given-names>H</given-names>
</name>
<name><surname>Gama-Castro</surname>
<given-names>S</given-names>
</name>
<name><surname>Martinez-Antonio</surname>
<given-names>A</given-names>
</name>
<name><surname>Diaz-Peredo</surname>
<given-names>E</given-names>
</name>
<name><surname>Sanchez-Solano</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<year>2004</year>
<article-title>RegulonDB (version 4.0): Transcriptional regulation, operon organization and growth conditions in <named-content content-type="genus-species">Escherichia coli</named-content>
 K-12</article-title>
<source>Nucleic Acids Res</source>
<volume>32</volume>
<fpage>D303</fpage>
<lpage>D306</lpage>
<pub-id pub-id-type="pmid">14681419</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b142"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thompson</surname>
<given-names>JD</given-names>
</name>
<name><surname>Higgins</surname>
<given-names>DG</given-names>
</name>
<name><surname>Gibson</surname>
<given-names>TJ</given-names>
</name>
</person-group>
<year>1994</year>
<article-title>CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice</article-title>
<source>Nucleic Acids Res</source>
<volume>22</volume>
<fpage>4673</fpage>
<lpage>4680</lpage>
<pub-id pub-id-type="pmid">7984417</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b143"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mailund</surname>
<given-names>T</given-names>
</name>
<name><surname>Pedersen</surname>
<given-names>CN</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>QuickJoin—Fast neighbour-joining tree reconstruction</article-title>
<source>Bioinformatics</source>
<volume>20</volume>
<fpage>3261</fpage>
<lpage>3262</lpage>
<pub-id pub-id-type="pmid">15201185</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b144"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Howe</surname>
<given-names>K</given-names>
</name>
<name><surname>Bateman</surname>
<given-names>A</given-names>
</name>
<name><surname>Durbin</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>QuickTree: Building huge neighbour-joining trees of protein sequences</article-title>
<source>Bioinformatics</source>
<volume>18</volume>
<fpage>1546</fpage>
<lpage>1547</lpage>
<pub-id pub-id-type="pmid">12424131</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b145"><citation citation-type="book"><person-group person-group-type="author"><name><surname>Felsenstein</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2005</year>
<source>PHYLIP (Phylogeny Inference Package) 3.6 edition [computer program]</source>
<publisher-loc>Seattle</publisher-loc>
<publisher-name>Department of Genome Sciences, University of Washington, Seattle</publisher-name>
</citation>
</ref>
<ref id="pbio-0050016-b146"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lord</surname>
<given-names>PW</given-names>
</name>
<name><surname>Stevens</surname>
<given-names>RD</given-names>
</name>
<name><surname>Brass</surname>
<given-names>A</given-names>
</name>
<name><surname>Goble</surname>
<given-names>CA</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Investigating semantic similarity measures across the Gene Ontology: The relationship between sequence and annotation</article-title>
<source>Bioinformatics</source>
<volume>19</volume>
<fpage>1275</fpage>
<lpage>1283</lpage>
<pub-id pub-id-type="pmid">12835272</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b147"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krogh</surname>
<given-names>A</given-names>
</name>
<name><surname>Larsson</surname>
<given-names>B</given-names>
</name>
<name><surname>von Heijne</surname>
<given-names>G</given-names>
</name>
<name><surname>Sonnhammer</surname>
<given-names>EL</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes</article-title>
<source>J Mol Biol</source>
<volume>305</volume>
<fpage>567</fpage>
<lpage>580</lpage>
<pub-id pub-id-type="pmid">11152613</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b148"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Juretic</surname>
<given-names>D</given-names>
</name>
<name><surname>Zoranic</surname>
<given-names>L</given-names>
</name>
<name><surname>Zucic</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Basic charge clusters and predictions of membrane protein topology</article-title>
<source>J Chem Inf Comput Sci</source>
<volume>42</volume>
<fpage>620</fpage>
<lpage>632</lpage>
<pub-id pub-id-type="pmid">12086524</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b149"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Joachimiak</surname>
<given-names>MP</given-names>
</name>
<name><surname>Cohen</surname>
<given-names>FE</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>JEvTrace: Refinement and variations of the evolutionary trace in JAVA</article-title>
<source>Genome Biol</source>
<volume>3</volume>
<fpage>RESEARCH0077</fpage>
<pub-id pub-id-type="pmid">12537566</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b150"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guindon</surname>
<given-names>S</given-names>
</name>
<name><surname>Gascuel</surname>
<given-names>O</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood</article-title>
<source>Syst Biol</source>
<volume>52</volume>
<fpage>696</fpage>
<lpage>704</lpage>
<pub-id pub-id-type="pmid">14530136</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b151"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidt</surname>
<given-names>HA</given-names>
</name>
<name><surname>Strimmer</surname>
<given-names>K</given-names>
</name>
<name><surname>Vingron</surname>
<given-names>M</given-names>
</name>
<name><surname>von Haeseler</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing</article-title>
<source>Bioinformatics</source>
<volume>18</volume>
<fpage>502</fpage>
<lpage>504</lpage>
<pub-id pub-id-type="pmid">11934758</pub-id>
</citation>
</ref>
<ref id="pbio-0050016-b152"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bruno</surname>
<given-names>WJ</given-names>
</name>
<name><surname>Socci</surname>
<given-names>ND</given-names>
</name>
<name><surname>Halpern</surname>
<given-names>AL</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction</article-title>
<source>Mol Biol Evol</source>
<volume>17</volume>
<fpage>189</fpage>
<lpage>197</lpage>
<pub-id pub-id-type="pmid">10666718</pub-id>
</citation>
</ref>
</ref-list>
<sec sec-type="display-objects"><title>Figures and Tables</title>
<fig id="oceaniclogo" position="float"><graphic xlink:href="oceaniclogo"></graphic>
</fig>
<fig id="pbio-0050016-g001" position="float"><label>Figure 1</label>
<caption><title>Proportion of Sequences for Each Kingdom</title>
<p>(A) The combined set of NCBI-nr, PG, TGI-EST, and ENS has 3,167,979 sequences. The eukaryotes account for the largest portion and is more than twice the bacterial fraction.</p>
<p>(B) Predicted kingdom proportion of sequences in GOS. Out of the 5,654,638 GOS sequences, 5,058,757 are assigned kingdoms using a BLAST-based scheme. The bacterial kingdom forms by far the largest fraction in the GOS set.</p>
</caption>
<graphic xlink:href="pbio.0050016.g001"></graphic>
</fig>
<fig id="pbio-0050016-g002" position="float"><label>Figure 2</label>
<caption><title>Rate of Discovery of Clusters as (Nonredundant) Sequences Are Added</title>
<p>The <italic>x</italic>
-axis denotes the number of sequences (in millions) and the <italic>y</italic>
-axis denotes the number of clusters (in thousands). Seven datasets with increasing numbers of (nonredundant) sequences are chosen as described in the text. The blue curve shows the number of core sets of size ≥3 for the seven datasets. Curves for core set sizes ≥5, ≥10, and ≥20 are also shown. Linear regression gives slopes 0.027 (<italic>R</italic>
<sup>2</sup>
 = 0.999), 0.011 (<italic>R</italic>
<sup>2</sup>
 = 0.999), 0.0053 (<italic>R</italic>
<sup>2</sup>
 = 0.999), and 0.0024 (<italic>R</italic>
<sup>2</sup>
 = 0.996) for size ≥3, size ≥5, size ≥10, and size ≥20, respectively.</p>
</caption>
<graphic xlink:href="pbio.0050016.g002"></graphic>
</fig>
<fig id="pbio-0050016-g003" position="float"><label>Figure 3</label>
<caption><title>Venn Diagram Showing Breakdown of the 17,067 Medium and Large Clusters by Three Categories—GOS, Known Prokaryotic, and Known Nonprokaryotic</title>
</caption>
<graphic xlink:href="pbio.0050016.g003"></graphic>
</fig>
<fig id="pbio-0050016-g004" position="float"><label>Figure 4</label>
<caption><title>Enrichment in the GOS-Only Set of Clusters for Viral Neighbors</title>
<p>Cluster sets from left to right are: I, GOS-only clusters with detectable BLAST, HMM, or profile-profile homology (Group I); II, GOS-only clusters with no detectable homology (Group II); I-S, a sample from all clusters chosen to have the same size distribution as Group I; II-S, a sample from all clusters chosen to have the same size distribution as Group II; I-V, a subset of clusters in Group I containing sequences collected from the viral size fraction; II-V, a subset of clusters in Group II from the viral size fraction; and all clusters. Notice that although predominantly bacterial, GOS-only clusters are assigned as viral based on their neighbors more often than the size-matched samples and the set of all clusters.</p>
</caption>
<graphic xlink:href="pbio.0050016.g004"></graphic>
</fig>
<fig id="pbio-0050016-g005" position="float"><label>Figure 5</label>
<caption><title>Coverage of GOS-100 and Public-100 by Pfam and Relative Sizes of Pfam Families by Kingdom, Sorted by Size</title>
<p>The public-100 sequences are annotated using the NCBI taxonomy and the source public database annotations. GOS-100 sequences were given kingdom weights as described in <xref ref-type="sec" rid="s3">Materials and Methods</xref>
. For each kingdom, the fraction of sequences with ≥1 Pfam match are shown, while the ten largest Pfam families shown as discrete sections whose size is proportional to the number of matches between that family and GOS-100 or public-100 sequences. Pfam families that are smaller than the ten largest are binned together in each column's bottom section. Pfam covers public-100 better than GOS-100 in all kingdoms, with the greatest difference occurring in the viral kingdom, where 89.1% of public-100 viral sequences match a Pfam domain, while only 27.5% of GOS-100s have a sequence match.</p>
</caption>
<graphic xlink:href="pbio.0050016.g005"></graphic>
</fig>
<fig id="pbio-0050016-g006" position="float"><label>Figure 6</label>
<caption><title>Maximum Likelihood Phylogeny for the IDO Family</title>
<p>The phylogeny is based on an alignment of 93 sequences from GOS-100 and 51 sequences from public-100 and NCBI-nr from March 2006 that matched the IDO Pfam model and satisfied multiple alignment quality criteria. The IDO family is eukaryotic specific in public-100. The phylogeny shows a clade with all the GOS sequences, predicted to be bacterial (navy blue), eukaryotic (yellow), or unknown (gray), along with two sequences from the marine bacteria <named-content content-type="genus-species">Erythrobacter litoralis</named-content>
 and <italic>Nitrosococcus oceani</italic>
 (lime green) submitted to the sequence database after February 2005, and a public-only clade of only eukaryotic sequences (orange).</p>
</caption>
<graphic xlink:href="pbio.0050016.g006"></graphic>
</fig>
<fig id="pbio-0050016-g007" position="float"><label>Figure 7</label>
<caption><title>Phylogenies Illustrating the Diversity Added by GOS Data to Known Families That We Examined</title>
<p>Kingdom assignments of the sequences are indicated by color: yellow, GOS-eukaryotic; navy blue, GOS-bacterial/archaeal; aqua, GOS-viral; orange, NCBI-nr–eukaryotic; lime green, NCBI-nr–bacterial/archaeal; pink, NCBI-nr–viral; gray, unclassified.</p>
<p>(A) Phylogeny of UVDE homologs.</p>
<p>(B) Phylogeny of PP2C-like sequences.</p>
<p>(C) Phylogeny of type II GS gene family. In addition to the large amount of diversity of bacterial type II GS in the GOS data, a large group of GOS viral sequences and eukaryotic GS co-occur at the top of the tree with the eukaryotic virus <named-content content-type="genus-species">Acanthamoeba polyphaga</named-content>
 mimivirus (shown in pink). The red stars indicate the locations of eight type II GS sequences found in the type I–type II GS gene pairs. They are located in different branches of the phylogenetic tree. The rest of the type II GS sequences were filtered out by the 98% identity cutoff.</p>
<p>(D) Phylogeny of the homologs of RuBisCO large subunit. A large portion of the RuBisCO sequences from the GOS data forms new branches that are distinct from the previously known RuBisCO sequences in the NCBI-nr database.</p>
</caption>
<graphic xlink:href="pbio.0050016.g007"></graphic>
</fig>
<fig id="pbio-0050016-g008" position="float"><label>Figure 8</label>
<caption><title>Distribution of Average HMM Score Difference between GOS and Public (NCBI-nr, MG, TGI-EST, and ENS)</title>
<p>Only matches to the full length of an HMM are considered, and only HMMs that have at least 100 matches to each of GOS and public databases are considered. This results in 1,686 HMMs whose average scores to GOS and public databases are considered. The mean of the distribution is −50, showing that GOS sequences tend to score lower than sequences in public, thereby reflecting diversity compared to sequences in public.</p>
</caption>
<graphic xlink:href="pbio.0050016.g008"></graphic>
</fig>
<fig id="pbio-0050016-g009" position="float"><label>Figure 9</label>
<caption><title>Pie Chart of ORFans That Had GOS Matches</title>
<p>ORFans are grouped by organism (left), number of their GOS matches (middle), and the lowest <italic>E-</italic>
value to their GOS matches in negative logarithm form (right). For both middle and right charts, inner and outer circles represent noneukaryotic and eukaryotic ORFans, respectively. From the middle chart it is seen that 626 (= 404 + 180 + 21 + 21) ORFans form significant protein families with ≥20 GOS matches.</p>
</caption>
<graphic xlink:href="pbio.0050016.g009"></graphic>
</fig>
<fig id="pbio-0050016-g010" position="float"><label>Figure 10</label>
<caption><title>Structure and GOS Homologs of Hypothetical Protein AF1548</title>
<p>Yellow bars represent β-strands. Highlighted are predicted catalytic residues: 38D, 51E, and 53K.</p>
</caption>
<graphic xlink:href="pbio.0050016.g010"></graphic>
</fig>
<fig id="pbio-0050016-g011" position="float"><label>Figure 11</label>
<caption><title>Rate of Cluster Discovery for Mammals Compared to That for Microbes</title>
<p>The <italic>x</italic>
-axis denotes the number of sequences (in thousands), and the <italic>y</italic>
-axis denotes the number of clusters (in thousands). Five mammalian genomes are considered for the “Mammalian” dataset, and the plot shows the number of clusters that are hit when each additional genome is added. For the “Mammalian Random” dataset, the order of the sequences from the “Mammalian” dataset is randomized. For the NCBI-nr prokaryotic and GOS datasets, random subsets of size similar to that of the mammalian set are considered.</p>
</caption>
<graphic xlink:href="pbio.0050016.g011"></graphic>
</fig>
<fig id="pbio-0050016-g012" position="float"><label>Figure 12</label>
<caption><title>Log–Log Plots of Cluster Size Distributions</title>
<p>The <italic>x</italic>
-axis is logarithm of the cluster size <italic>X</italic>
 and the <italic>y</italic>
-axis is the logarithm of the number of clusters of size at least <italic>X;</italic>
 logarithms are base 10.</p>
<p>(A) Plot comparing the sizes of clusters produced by our clustering approach (red) to those of clusters produced by Pfams (green). The curves track each other quite well, with both of them having an inflection point around cluster size 2,500 (approximately 3.4 on the <italic>x</italic>
-axis). Each sequence is assigned to the highest scoring Pfam that it matches. Two sequences that are assigned to the same Pfam can nevertheless be assigned to different clusters by the full-sequence–based clustering approach if they differ in the remaining portion. This is especially true for commonly occurring domains that are present in different multidomain proteins. Thus, there tends to be a larger number of big clusters in the Pfam approach as compared to the full-sequence–based approach. Hence, the green curve is above the red curve at the higher sizes.</p>
<p>(B) Plot of the cluster size distributions for core sets (green) and for final clusters (red). Both curves have an inflection point around cluster size 2,500 (approximately 3.4 on the <italic>x</italic>
-axis). Note that these plots give the cumulative distribution function (cdf), while the power law exponents reported in the text are for the number of clusters of size <italic>X</italic>
 (i.e., the probability density function [pdf]). The relationship between these exponents is β<sub>pdf</sub>
 = 1 + β<sub>cdf</sub>
.</p>
</caption>
<graphic xlink:href="pbio.0050016.g012"></graphic>
</fig>
<fig id="pbio-0050016-g013" position="float"><label>Figure 13</label>
<caption><title>Log–Log plot of Slopes <italic>m</italic>
(<italic>d</italic>
) of Linear Regression Fit to the Rate of Growth in <xref ref-type="fig" rid="pbio-0050016-g002">Figure 2</xref>
 for Different Values of Cluster Size <italic>d</italic>
</title>
<p>According to the equation derived in the text, <italic>m</italic>
(<italic>d</italic>
) <italic>= md<sup>1</sup>
</italic>
<sup>−β</sup>
 for some constant <italic>m</italic>
. The best linear fit to log [<italic>m</italic>
(<italic>d</italic>
)] gives a line with slope −0.91 (<italic>R</italic>
<sup>2</sup>
 = 0.98) that is close to the predicted value 1 − β = −0.99.</p>
</caption>
<graphic xlink:href="pbio.0050016.g013"></graphic>
</fig>
<fig id="pbio-0050016-g014" position="float"><label>Figure 14</label>
<caption><title>Receiver Operating Characteristic Curve Used to Evaluate Various Methods of Scoring Pairs of Clusters for Functional Similarity</title>
<p>Pairs of clusters with ≥1 example of neighboring ORFs and assigned GO terms were divided into a set of functionally related (true positive) and functionally unrelated (true negative) cluster pairs based on the similarity of their GO terms. The scoring methods evaluated are described in the text.</p>
</caption>
<graphic xlink:href="pbio.0050016.g014"></graphic>
</fig>
<fig id="pbio-0050016-g015" position="float"><label>Figure 15</label>
<caption><title>Novel GOS-Only Clusters Are More Interconnected Than a Size-Matched Sample of Clusters</title>
<p>Red line, novel clusters; green line, size-matched sample; blue line (right axis), log<sub>2</sub>
 ratio of fraction novel clusters recovered divided by fraction sample clusters recovered.</p>
</caption>
<graphic xlink:href="pbio.0050016.g015"></graphic>
</fig>
<fig id="pbio-0050016-g016" position="float"><label>Figure 16</label>
<caption><title>GOS-Only Clusters Are Enriched for Sequences of Viral Origin Independently of the Kingdom Assignment Method Employed</title>
<p>For each panel, clusters are as in <xref ref-type="fig" rid="pbio-0050016-g004">Figure 4</xref>
. For (A–C), a kingdom is assigned to each neighboring ORF within each cluster set; the percentage of all neighboring ORFs with a given kingdom assignment is plotted. For (D–F), a kingdom is assigned to each cluster if more than 50% of all that cluster's neighbors with a kingdom assignment share the same assignment; the percentage of clusters in each set with a given assignment is plotted. In (A) and (D), a kingdom is assigned to a neighboring ORF by a majority vote of the top four BLAST matches to a protein in NCBI-nr (<xref ref-type="sec" rid="s3">Materials and Methods</xref>
). In (B) and (E), a kingdom is assigned if all eight highest-scoring BLAST matches agree in kingdom. In (C) and (F), all ORFs on a scaffold are assigned the same kingdom by voting among all ORFs with BLAST matches to NCBI-nr on that scaffold (<xref ref-type="sec" rid="s3">Materials and Methods</xref>
). In all graphs, only clusters with at least one assignable neighbor are considered. When compared to the size-matched controls, in all cases the GOS-only clusters show enrichment for viral sequences.</p>
</caption>
<graphic xlink:href="pbio.0050016.g016"></graphic>
</fig>
<fig id="pbio-0050016-g017" position="float"><label>Figure 17</label>
<caption><title>Content of Protease Types in NCBI-nr and GOS, and Kingdom Distribution of All Proteases</title>
<p>Due to the highly redundant nature of some NCBI-nr protease groups, nonredundant sets for both NCBI-nr and GOS are computed; these nonredundant sets are referred to as NCBI-nr60 and GOS60.</p>
</caption>
<graphic xlink:href="pbio.0050016.g017"></graphic>
</fig>
<fig id="pbio-0050016-g018" position="float"><label>Figure 18</label>
<caption><title>Content of Bacterial Protease Clans</title>
</caption>
<graphic xlink:href="pbio.0050016.g018"></graphic>
</fig>
<table-wrap id="pbio-0050016-t001" content-type="2col" position="float"><label>Table 1</label>
<caption><p>The Complete Dataset Consisted of Sequences from NCBI-nr, ENS, TGI-EST, PG, and GOS, for a Total of 28,610,944 Sequences</p>
</caption>
<graphic xlink:href="pbio.0050016.t001"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t002" content-type="2col" position="float"><label>Table 2</label>
<caption><p>Clustering and HMM Profiling Results Showing the Number of Predicted Proteins (Including Both Redundant and Nonredundant Sequences) in Each Dataset</p>
</caption>
<graphic xlink:href="pbio.0050016.t002"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t003" content-type="2col" position="float"><label>Table 3</label>
<caption><p>Cluster Size Distribution and the Distribution of Sequences in These Clusters</p>
</caption>
<graphic xlink:href="pbio.0050016.t003"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t004" content-type="2col" position="float"><label>Table 4</label>
<caption><p>List of the Top 25 Clusters from the Clustering Process</p>
</caption>
<graphic xlink:href="pbio.0050016.t004"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t005" content-type="2col" position="float"><label>Table 5</label>
<caption><p>Neighbor-Based Inference of Function for Novel Clusters of GOS Sequences</p>
</caption>
<graphic xlink:href="pbio.0050016.t005"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t006" content-type="2col" position="float"><label>Table 6</label>
<caption><p>Functions Skewed in Domain Representation between PG and GOS</p>
</caption>
<graphic xlink:href="pbio.0050016.t006"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t007" content-type="2col" position="float"><label>Table 7</label>
<caption><p>Top Pfam Families Represented More Highly or Less Highly in GOS-100 than in Public-100</p>
</caption>
<graphic xlink:href="pbio.0050016.t007"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t008" content-type="1col" position="float"><label>Table 8</label>
<caption><p>New Multi-Kingdom Pfams</p>
</caption>
<graphic xlink:href="pbio.0050016.t008"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t009" content-type="2col" position="float"><label>Table 9</label>
<caption><p>The 30 Largest Structural Genomics Target Families Added to the Pfam5000 Based on Inclusion of GOS Sequences</p>
</caption>
<graphic xlink:href="pbio.0050016.t009"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t010" content-type="2col" position="float"><label>Table 10</label>
<caption><p>Clustering of Sequences in Families That Are Explored in This and Companion Papers</p>
</caption>
<graphic xlink:href="pbio.0050016.t010"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t011" content-type="2col" position="float"><label>Table 11</label>
<caption><p>Top 20 Organisms with Most ORFans Matched by GOS</p>
</caption>
<graphic xlink:href="pbio.0050016.t011"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t012" content-type="1col" position="float"><label>Table 12</label>
<caption><p>The Number of Sequences in NCBI-nr, PG ORFs, TGI-EST ORFs, ENS, and GOS ORFs prior to and after the Redundancy Removal Step of Our Clustering</p>
</caption>
<graphic xlink:href="pbio.0050016.t012"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t013" content-type="1col" position="float"><label>Table 13</label>
<caption><p>BLAST-Based Classification Rate per Kingdom</p>
</caption>
<graphic xlink:href="pbio.0050016.t013"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t014" content-type="1col" position="float"><label>Table 14</label>
<caption><p>The Values for <italic>C<sub>≥d</sub>
</italic>
(<italic>n</italic>
), the Number of Clusters of Size <italic>≥d</italic>
, as a Function of the Power Law Exponent β and Constant α</p>
</caption>
<graphic xlink:href="pbio.0050016.t014"></graphic>
</table-wrap>
<table-wrap id="pbio-0050016-t015" content-type="1col" position="float"><label>Table 15</label>
<caption><p>Clustering Information for Ensembl Sequences for <italic>H. sapiens, M. musculus, R. norvegicus, C. familiaris</italic>
, and <named-content content-type="genus-species">P. troglodytes</named-content>
</p>
</caption>
<graphic xlink:href="pbio.0050016.t015"></graphic>
</table-wrap>
</sec>
<fn-group><fn id="n103" fn-type="other"><p>This article is part of Global Ocean Sampling collection in <italic>PLoS Biology.</italic>
 The full collection is available online at <ext-link ext-link-type="uri" xlink:href="http://collections.plos.org/plosbiology/gos-2007.php">http://collections.plos.org/plosbiology/gos-2007.php</ext-link>
.</p>
</fn>
<fn id="ack1" fn-type="con"><p><bold>Author contributions.</bold>
 SY contributed to the design and implementation of the clustering process, and the subsequent analyses of the clusters; he also contributed to and coordinated all of the analyses in the paper, and wrote a large portion of the paper. GS contributed to the design and analysis of the clustering process, contributed ideas, analysis, and also wrote parts of the paper. DBR identified ORFs from the assemblies, performed the all-against-all BLAST searches, contributed to GOS kingdom assignment, and contributed analysis tools and ideas. ALH performed the assembly of GOS sequences, and contributed analysis tools and ideas. SW contributed to the analysis of viral sequences. KR contributed to project planning and paper writing. JAE performed the analysis of UV damage repair enzymes, and also contributed to paper writing. KBH, RF, and RLS contributed to project planning. GM performed the profile HMM searches, carried out the domain analysis, and contributed to paper writing. WL and AG carried out the ORFan analysis and contributed to paper writing. LJ contributed to the profile-profile search process. PC and AG carried out the analysis of proteases and contributed to paper writing. CSM, HL, and DE carried out the analysis of novel clusters, the analysis of metabolic enzymes and contributed to paper writing. YZ contributed to the profile HMM searches and domain analysis. STM, MPJ, CvB, DAS, and SEB carried out the analysis of Pfam domain distributions in GOS and current proteins, analysis of IDO, contributed to GOS kingdom assignment, and also contributed to paper writing. DAS and SEB also contributed to the Ka/Ks test. JMC and SEB carried out the analysis on the implications for structural genomics and contributed to paper writing. SL, KN, SST, and JED carried out the phosphatase analysis and contributed to paper writing. SST and JED also contributed to project planning. BJR and VB contributed to the analysis of cluster size distribution, family discovery rate, and contributed to paper writing. MF contributed to paper writing, project planning, and ideas for analysis. JCV conceived and coordinated the project, and supplied ideas.</p>
</fn>
<fn id="ack2" fn-type="financial-disclosure"><p><bold>Funding.</bold>
 The authors acknowledge the Department of Energy Genomics: GTL Program, Office of Science (DE-FG02-02ER63453), the Gordon and Betty Moore Foundation, the Discovery Channel and the J. Craig Venter Science Foundation for funding to undertake this study. GM acknowledges funding from the Razavi-Newman Center for Bioinformatics and was also supported by National Cancer Institute grant P30 CA014195. PC was partially supported by a Center for Proteolytic Pathways (CPP)–National Institutes of Health (NIH) grant 5U54 RR020843–02. CSM, HL, and DE acknowledge the support of DOE Biological and Environmental Research (BER). SL and JED were supported by research grants from NIH. BJR was supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. Support for the Brenner lab work was provided by NIH K22 HG00056 and an IBM Shared University Research grant. STM was supported by NIH Genomics Training Grant 5T32 HG00047. MPJ was supported by NIH P20 GM068136 and NIH K22 HG00056. CvB was supported in part by the Haas Scholars Program. DAS was supported by a Howard Hughes Medical Institute Predoctoral Fellowship. JMC was supported by NIH grant R01 GM073109, and by the US Department of Energy Genomics: GTL program through contract DE-AC02-05CH11231.</p>
</fn>
<fn id="ack3" fn-type="conflict"><p><bold>Competing interests.</bold>
 The authors have declared that no competing interests exist.</p>
</fn>
</fn-group>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000592 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000592 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:1821046
   |texte=   The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:17355171" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024

	Serveur d'exploration Cyberinfrastructure
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration Cyberinfrastructure

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki