Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies

Identifieur interne : 000587 ( Pmc/Corpus ); précédent : 000586; suivant : 000588

Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies

Auteurs : H. James Tripp ; Ian Hewson ; Sam Boyarsky ; Joshua M. Stuart ; Jonathan P. Zehr

Source :

RBID : PMC:3203614

Abstract

In the course of analyzing 9 522 746 pyrosequencing reads from 23 stations in the Southwestern Pacific and equatorial Atlantic oceans, it came to our attention that misannotations of rRNA as proteins is now so widespread that false positive matching of rRNA pyrosequencing reads to the National Center for Biotechnology Information (NCBI) non-redundant protein database approaches 90%. One conserved portion of 23S rRNA was consistently misannotated often enough to prompt curators at Pfam to create a spurious protein family. Detailed examination of the annotation history of each seed sequence in the spurious Pfam protein family (PF10695, ‘Cw-hydrolase’) uncovered issues in the standard operating procedures and quality assurance programs of major sequencing centers, and other issues relating to the curation practices of those managing public databases such as GenBank and SwissProt. We offer recommendations for all these issues, and recommend as well that workers in the field of metatranscriptomics take extra care to avoid including false positive matches in their datasets.


Url:
DOI: 10.1093/nar/gkr576
PubMed: 21771858
PubMed Central: 3203614

Links to Exploration step

PMC:3203614

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies</title>
<author>
<name sortKey="Tripp, H James" sort="Tripp, H James" uniqKey="Tripp H" first="H. James" last="Tripp">H. James Tripp</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hewson, Ian" sort="Hewson, Ian" uniqKey="Hewson I" first="Ian" last="Hewson">Ian Hewson</name>
<affiliation>
<nlm:aff id="AFF1">Department of Microbiology, Cornell University, Wing Hall 403, Ithaca, NY 14853, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Boyarsky, Sam" sort="Boyarsky, Sam" uniqKey="Boyarsky S" first="Sam" last="Boyarsky">Sam Boyarsky</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stuart, Joshua M" sort="Stuart, Joshua M" uniqKey="Stuart J" first="Joshua M." last="Stuart">Joshua M. Stuart</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zehr, Jonathan P" sort="Zehr, Jonathan P" uniqKey="Zehr J" first="Jonathan P." last="Zehr">Jonathan P. Zehr</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21771858</idno>
<idno type="pmc">3203614</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3203614</idno>
<idno type="RBID">PMC:3203614</idno>
<idno type="doi">10.1093/nar/gkr576</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000587</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies</title>
<author>
<name sortKey="Tripp, H James" sort="Tripp, H James" uniqKey="Tripp H" first="H. James" last="Tripp">H. James Tripp</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hewson, Ian" sort="Hewson, Ian" uniqKey="Hewson I" first="Ian" last="Hewson">Ian Hewson</name>
<affiliation>
<nlm:aff id="AFF1">Department of Microbiology, Cornell University, Wing Hall 403, Ithaca, NY 14853, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Boyarsky, Sam" sort="Boyarsky, Sam" uniqKey="Boyarsky S" first="Sam" last="Boyarsky">Sam Boyarsky</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Stuart, Joshua M" sort="Stuart, Joshua M" uniqKey="Stuart J" first="Joshua M." last="Stuart">Joshua M. Stuart</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zehr, Jonathan P" sort="Zehr, Jonathan P" uniqKey="Zehr J" first="Jonathan P." last="Zehr">Jonathan P. Zehr</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Nucleic Acids Research</title>
<idno type="ISSN">0305-1048</idno>
<idno type="eISSN">1362-4962</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>In the course of analyzing 9 522 746 pyrosequencing reads from 23 stations in the Southwestern Pacific and equatorial Atlantic oceans, it came to our attention that misannotations of rRNA as proteins is now so widespread that false positive matching of rRNA pyrosequencing reads to the National Center for Biotechnology Information (NCBI) non-redundant protein database approaches 90%. One conserved portion of 23S rRNA was consistently misannotated often enough to prompt curators at Pfam to create a spurious protein family. Detailed examination of the annotation history of each seed sequence in the spurious Pfam protein family (PF10695, ‘Cw-hydrolase’) uncovered issues in the standard operating procedures and quality assurance programs of major sequencing centers, and other issues relating to the curation practices of those managing public databases such as GenBank and SwissProt. We offer recommendations for all these issues, and recommend as well that workers in the field of metatranscriptomics take extra care to avoid including false positive matches in their datasets.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, R" uniqKey="Roberts R">R Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Woese, Cr" uniqKey="Woese C">CR Woese</name>
</author>
<author>
<name sortKey="Fox, Ge" uniqKey="Fox G">GE Fox</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dunn, Jj" uniqKey="Dunn J">JJ Dunn</name>
</author>
<author>
<name sortKey="Studier, Fw" uniqKey="Studier F">FW Studier</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ginsburg, D" uniqKey="Ginsburg D">D Ginsburg</name>
</author>
<author>
<name sortKey="Steitz, Ja" uniqKey="Steitz J">JA Steitz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smitt, Ww" uniqKey="Smitt W">WW Smitt</name>
</author>
<author>
<name sortKey="Vlak, Jm" uniqKey="Vlak J">JM Vlak</name>
</author>
<author>
<name sortKey="Schiphof, R" uniqKey="Schiphof R">R Schiphof</name>
</author>
<author>
<name sortKey="Rozijn, Th" uniqKey="Rozijn T">TH Rozijn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Udem, Sa" uniqKey="Udem S">SA Udem</name>
</author>
<author>
<name sortKey="Warner, Jr" uniqKey="Warner J">JR Warner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brosius, J" uniqKey="Brosius J">J Brosius</name>
</author>
<author>
<name sortKey="Dull, Tj" uniqKey="Dull T">TJ Dull</name>
</author>
<author>
<name sortKey="Noller, Hf" uniqKey="Noller H">HF Noller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brosius, J" uniqKey="Brosius J">J Brosius</name>
</author>
<author>
<name sortKey="Palmer, Ml" uniqKey="Palmer M">ML Palmer</name>
</author>
<author>
<name sortKey="Kennedy, Pj" uniqKey="Kennedy P">PJ Kennedy</name>
</author>
<author>
<name sortKey="Noller, Hf" uniqKey="Noller H">HF Noller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brownlee, Gg" uniqKey="Brownlee G">GG Brownlee</name>
</author>
<author>
<name sortKey="Sanger, F" uniqKey="Sanger F">F Sanger</name>
</author>
<author>
<name sortKey="Barrell, Bg" uniqKey="Barrell B">BG Barrell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carbon, P" uniqKey="Carbon P">P Carbon</name>
</author>
<author>
<name sortKey="Ehresmann, C" uniqKey="Ehresmann C">C Ehresmann</name>
</author>
<author>
<name sortKey="Ehresmann, B" uniqKey="Ehresmann B">B Ehresmann</name>
</author>
<author>
<name sortKey="Ebel, Jp" uniqKey="Ebel J">JP Ebel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Georgiev, Oi" uniqKey="Georgiev O">OI Georgiev</name>
</author>
<author>
<name sortKey="Nikolaev, N" uniqKey="Nikolaev N">N Nikolaev</name>
</author>
<author>
<name sortKey="Hadjiolov, Aa" uniqKey="Hadjiolov A">AA Hadjiolov</name>
</author>
<author>
<name sortKey="Skryabin, Kg" uniqKey="Skryabin K">KG Skryabin</name>
</author>
<author>
<name sortKey="Zakharyev, Vm" uniqKey="Zakharyev V">VM Zakharyev</name>
</author>
<author>
<name sortKey="Bayev, Aa" uniqKey="Bayev A">AA Bayev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hindley, J" uniqKey="Hindley J">J Hindley</name>
</author>
<author>
<name sortKey="Page, Sm" uniqKey="Page S">SM Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rubin, Gm" uniqKey="Rubin G">GM Rubin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rubtsov, Pm" uniqKey="Rubtsov P">PM Rubtsov</name>
</author>
<author>
<name sortKey="Musakhanov, Mm" uniqKey="Musakhanov M">MM Musakhanov</name>
</author>
<author>
<name sortKey="Zakharyev, Vm" uniqKey="Zakharyev V">VM Zakharyev</name>
</author>
<author>
<name sortKey="Krayev, As" uniqKey="Krayev A">AS Krayev</name>
</author>
<author>
<name sortKey="Skryabin, Kg" uniqKey="Skryabin K">KG Skryabin</name>
</author>
<author>
<name sortKey="Bayev, Aa" uniqKey="Bayev A">AA Bayev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tenson, T" uniqKey="Tenson T">T Tenson</name>
</author>
<author>
<name sortKey="Deblasio, A" uniqKey="Deblasio A">A DeBlasio</name>
</author>
<author>
<name sortKey="Mankin, A" uniqKey="Mankin A">A Mankin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mitschke, J" uniqKey="Mitschke J">J Mitschke</name>
</author>
<author>
<name sortKey="Georg, J" uniqKey="Georg J">J Georg</name>
</author>
<author>
<name sortKey="Scholz, I" uniqKey="Scholz I">I Scholz</name>
</author>
<author>
<name sortKey="Sharma, Cm" uniqKey="Sharma C">CM Sharma</name>
</author>
<author>
<name sortKey="Dienst, D" uniqKey="Dienst D">D Dienst</name>
</author>
<author>
<name sortKey="Bantscheff, J" uniqKey="Bantscheff J">J Bantscheff</name>
</author>
<author>
<name sortKey="Voss, B" uniqKey="Voss B">B Voss</name>
</author>
<author>
<name sortKey="Steglich, C" uniqKey="Steglich C">C Steglich</name>
</author>
<author>
<name sortKey="Wilde, A" uniqKey="Wilde A">A Wilde</name>
</author>
<author>
<name sortKey="Vogel, J" uniqKey="Vogel J">J Vogel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Aziz, Rk" uniqKey="Aziz R">RK Aziz</name>
</author>
<author>
<name sortKey="Bartels, D" uniqKey="Bartels D">D Bartels</name>
</author>
<author>
<name sortKey="Best, Aa" uniqKey="Best A">AA Best</name>
</author>
<author>
<name sortKey="Dejongh, M" uniqKey="Dejongh M">M DeJongh</name>
</author>
<author>
<name sortKey="Disz, T" uniqKey="Disz T">T Disz</name>
</author>
<author>
<name sortKey="Edwards, Ra" uniqKey="Edwards R">RA Edwards</name>
</author>
<author>
<name sortKey="Formsma, K" uniqKey="Formsma K">K Formsma</name>
</author>
<author>
<name sortKey="Gerdes, S" uniqKey="Gerdes S">S Gerdes</name>
</author>
<author>
<name sortKey="Glass, Em" uniqKey="Glass E">EM Glass</name>
</author>
<author>
<name sortKey="Kubal, M" uniqKey="Kubal M">M Kubal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Coelho, Ps" uniqKey="Coelho P">PS Coelho</name>
</author>
<author>
<name sortKey="Bryan, Ac" uniqKey="Bryan A">AC Bryan</name>
</author>
<author>
<name sortKey="Kumar, A" uniqKey="Kumar A">A Kumar</name>
</author>
<author>
<name sortKey="Shadel, Gs" uniqKey="Shadel G">GS Shadel</name>
</author>
<author>
<name sortKey="Snyder, M" uniqKey="Snyder M">M Snyder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mauro, Vp" uniqKey="Mauro V">VP Mauro</name>
</author>
<author>
<name sortKey="Edelman, Gm" uniqKey="Edelman G">GM Edelman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chooi, Wy" uniqKey="Chooi W">WY Chooi</name>
</author>
<author>
<name sortKey="Leiby, Kr" uniqKey="Leiby K">KR Leiby</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kermekchiev, M" uniqKey="Kermekchiev M">M Kermekchiev</name>
</author>
<author>
<name sortKey="Ivanova, L" uniqKey="Ivanova L">L Ivanova</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scharf, Me" uniqKey="Scharf M">ME Scharf</name>
</author>
<author>
<name sortKey="Wu Scharf, D" uniqKey="Wu Scharf D">D Wu-Scharf</name>
</author>
<author>
<name sortKey="Zhou, X" uniqKey="Zhou X">X Zhou</name>
</author>
<author>
<name sortKey="Pittendrigh, Br" uniqKey="Pittendrigh B">BR Pittendrigh</name>
</author>
<author>
<name sortKey="Bennett, Gw" uniqKey="Bennett G">GW Bennett</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Finn, Rd" uniqKey="Finn R">RD Finn</name>
</author>
<author>
<name sortKey="Mistry, J" uniqKey="Mistry J">J Mistry</name>
</author>
<author>
<name sortKey="Tate, J" uniqKey="Tate J">J Tate</name>
</author>
<author>
<name sortKey="Coggill, P" uniqKey="Coggill P">P Coggill</name>
</author>
<author>
<name sortKey="Heger, A" uniqKey="Heger A">A Heger</name>
</author>
<author>
<name sortKey="Pollington, Je" uniqKey="Pollington J">JE Pollington</name>
</author>
<author>
<name sortKey="Gavin, Ol" uniqKey="Gavin O">OL Gavin</name>
</author>
<author>
<name sortKey="Gunasekaran, P" uniqKey="Gunasekaran P">P Gunasekaran</name>
</author>
<author>
<name sortKey="Ceric, G" uniqKey="Ceric G">G Ceric</name>
</author>
<author>
<name sortKey="Forslund, K" uniqKey="Forslund K">K Forslund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shi, Y" uniqKey="Shi Y">Y Shi</name>
</author>
<author>
<name sortKey="Tyson, Gw" uniqKey="Tyson G">GW Tyson</name>
</author>
<author>
<name sortKey="Delong, Ef" uniqKey="Delong E">EF DeLong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, S" uniqKey="Sun S">S Sun</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Altintas, I" uniqKey="Altintas I">I Altintas</name>
</author>
<author>
<name sortKey="Lin, A" uniqKey="Lin A">A Lin</name>
</author>
<author>
<name sortKey="Peltier, S" uniqKey="Peltier S">S Peltier</name>
</author>
<author>
<name sortKey="Stocks, K" uniqKey="Stocks K">K Stocks</name>
</author>
<author>
<name sortKey="Allen, Ee" uniqKey="Allen E">EE Allen</name>
</author>
<author>
<name sortKey="Ellisman, M" uniqKey="Ellisman M">M Ellisman</name>
</author>
<author>
<name sortKey="Grethe, J" uniqKey="Grethe J">J Grethe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rutherford, K" uniqKey="Rutherford K">K Rutherford</name>
</author>
<author>
<name sortKey="Parkhill, J" uniqKey="Parkhill J">J Parkhill</name>
</author>
<author>
<name sortKey="Crook, J" uniqKey="Crook J">J Crook</name>
</author>
<author>
<name sortKey="Horsnell, T" uniqKey="Horsnell T">T Horsnell</name>
</author>
<author>
<name sortKey="Rice, P" uniqKey="Rice P">P Rice</name>
</author>
<author>
<name sortKey="Rajandream, Ma" uniqKey="Rajandream M">MA Rajandream</name>
</author>
<author>
<name sortKey="Barrell, B" uniqKey="Barrell B">B Barrell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Markowitz, Vm" uniqKey="Markowitz V">VM Markowitz</name>
</author>
<author>
<name sortKey="Chen, Im" uniqKey="Chen I">IM Chen</name>
</author>
<author>
<name sortKey="Palaniappan, K" uniqKey="Palaniappan K">K Palaniappan</name>
</author>
<author>
<name sortKey="Chu, K" uniqKey="Chu K">K Chu</name>
</author>
<author>
<name sortKey="Szeto, E" uniqKey="Szeto E">E Szeto</name>
</author>
<author>
<name sortKey="Grechkin, Y" uniqKey="Grechkin Y">Y Grechkin</name>
</author>
<author>
<name sortKey="Ratner, A" uniqKey="Ratner A">A Ratner</name>
</author>
<author>
<name sortKey="Anderson, I" uniqKey="Anderson I">I Anderson</name>
</author>
<author>
<name sortKey="Lykidis, A" uniqKey="Lykidis A">A Lykidis</name>
</author>
<author>
<name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pruesse, E" uniqKey="Pruesse E">E Pruesse</name>
</author>
<author>
<name sortKey="Quast, C" uniqKey="Quast C">C Quast</name>
</author>
<author>
<name sortKey="Knittel, K" uniqKey="Knittel K">K Knittel</name>
</author>
<author>
<name sortKey="Fuchs, Bm" uniqKey="Fuchs B">BM Fuchs</name>
</author>
<author>
<name sortKey="Ludwig, W" uniqKey="Ludwig W">W Ludwig</name>
</author>
<author>
<name sortKey="Peplies, J" uniqKey="Peplies J">J Peplies</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glockner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huson, Dh" uniqKey="Huson D">DH Huson</name>
</author>
<author>
<name sortKey="Auch, Af" uniqKey="Auch A">AF Auch</name>
</author>
<author>
<name sortKey="Qi, J" uniqKey="Qi J">J Qi</name>
</author>
<author>
<name sortKey="Schuster, Sc" uniqKey="Schuster S">SC Schuster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lagesen, K" uniqKey="Lagesen K">K Lagesen</name>
</author>
<author>
<name sortKey="Hallin, P" uniqKey="Hallin P">P Hallin</name>
</author>
<author>
<name sortKey="Rodland, Ea" uniqKey="Rodland E">EA Rodland</name>
</author>
<author>
<name sortKey="Staerfeldt, Hh" uniqKey="Staerfeldt H">HH Staerfeldt</name>
</author>
<author>
<name sortKey="Rognes, T" uniqKey="Rognes T">T Rognes</name>
</author>
<author>
<name sortKey="Ussery, Dw" uniqKey="Ussery D">DW Ussery</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liolios, K" uniqKey="Liolios K">K Liolios</name>
</author>
<author>
<name sortKey="Chen, Im" uniqKey="Chen I">IM Chen</name>
</author>
<author>
<name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
<author>
<name sortKey="Tavernarakis, N" uniqKey="Tavernarakis N">N Tavernarakis</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Markowitz, Vm" uniqKey="Markowitz V">VM Markowitz</name>
</author>
<author>
<name sortKey="Kyrpides, Nc" uniqKey="Kyrpides N">NC Kyrpides</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kawarabayasi, Y" uniqKey="Kawarabayasi Y">Y Kawarabayasi</name>
</author>
<author>
<name sortKey="Sawada, M" uniqKey="Sawada M">M Sawada</name>
</author>
<author>
<name sortKey="Horikawa, H" uniqKey="Horikawa H">H Horikawa</name>
</author>
<author>
<name sortKey="Haikawa, Y" uniqKey="Haikawa Y">Y Haikawa</name>
</author>
<author>
<name sortKey="Hino, Y" uniqKey="Hino Y">Y Hino</name>
</author>
<author>
<name sortKey="Yamamoto, S" uniqKey="Yamamoto S">S Yamamoto</name>
</author>
<author>
<name sortKey="Sekine, M" uniqKey="Sekine M">M Sekine</name>
</author>
<author>
<name sortKey="Baba, S" uniqKey="Baba S">S Baba</name>
</author>
<author>
<name sortKey="Kosugi, H" uniqKey="Kosugi H">H Kosugi</name>
</author>
<author>
<name sortKey="Hosoyama, A" uniqKey="Hosoyama A">A Hosoyama</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kunst, F" uniqKey="Kunst F">F Kunst</name>
</author>
<author>
<name sortKey="Ogasawara, N" uniqKey="Ogasawara N">N Ogasawara</name>
</author>
<author>
<name sortKey="Moszer, I" uniqKey="Moszer I">I Moszer</name>
</author>
<author>
<name sortKey="Albertini, Am" uniqKey="Albertini A">AM Albertini</name>
</author>
<author>
<name sortKey="Alloni, G" uniqKey="Alloni G">G Alloni</name>
</author>
<author>
<name sortKey="Azevedo, V" uniqKey="Azevedo V">V Azevedo</name>
</author>
<author>
<name sortKey="Bertero, Mg" uniqKey="Bertero M">MG Bertero</name>
</author>
<author>
<name sortKey="Bessieres, P" uniqKey="Bessieres P">P Bessieres</name>
</author>
<author>
<name sortKey="Bolotin, A" uniqKey="Bolotin A">A Bolotin</name>
</author>
<author>
<name sortKey="Borchert, S" uniqKey="Borchert S">S Borchert</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medigue, C" uniqKey="Medigue C">C Medigue</name>
</author>
<author>
<name sortKey="Moszer, I" uniqKey="Moszer I">I Moszer</name>
</author>
<author>
<name sortKey="Viari, A" uniqKey="Viari A">A Viari</name>
</author>
<author>
<name sortKey="Danchin, A" uniqKey="Danchin A">A Danchin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Medigue, C" uniqKey="Medigue C">C Medigue</name>
</author>
<author>
<name sortKey="Rouxel, T" uniqKey="Rouxel T">T Rouxel</name>
</author>
<author>
<name sortKey="Vigier, P" uniqKey="Vigier P">P Vigier</name>
</author>
<author>
<name sortKey="Henaut, A" uniqKey="Henaut A">A Henaut</name>
</author>
<author>
<name sortKey="Danchin, A" uniqKey="Danchin A">A Danchin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Staden, R" uniqKey="Staden R">R Staden</name>
</author>
<author>
<name sortKey="Mclachlan, Ad" uniqKey="Mclachlan A">AD McLachlan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krogh, A" uniqKey="Krogh A">A Krogh</name>
</author>
<author>
<name sortKey="Mian, Is" uniqKey="Mian I">IS Mian</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Overbeek, R" uniqKey="Overbeek R">R Overbeek</name>
</author>
<author>
<name sortKey="Begley, T" uniqKey="Begley T">T Begley</name>
</author>
<author>
<name sortKey="Butler, Rm" uniqKey="Butler R">RM Butler</name>
</author>
<author>
<name sortKey="Choudhuri, Jv" uniqKey="Choudhuri J">JV Choudhuri</name>
</author>
<author>
<name sortKey="Chuang, Hy" uniqKey="Chuang H">HY Chuang</name>
</author>
<author>
<name sortKey="Cohoon, M" uniqKey="Cohoon M">M Cohoon</name>
</author>
<author>
<name sortKey="De Crecy Lagard, V" uniqKey="De Crecy Lagard V">V de Crecy-Lagard</name>
</author>
<author>
<name sortKey="Diaz, N" uniqKey="Diaz N">N Diaz</name>
</author>
<author>
<name sortKey="Disz, T" uniqKey="Disz T">T Disz</name>
</author>
<author>
<name sortKey="Edwards, R" uniqKey="Edwards R">R Edwards</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bakke, P" uniqKey="Bakke P">P Bakke</name>
</author>
<author>
<name sortKey="Carney, N" uniqKey="Carney N">N Carney</name>
</author>
<author>
<name sortKey="Deloache, W" uniqKey="Deloache W">W Deloache</name>
</author>
<author>
<name sortKey="Gearing, M" uniqKey="Gearing M">M Gearing</name>
</author>
<author>
<name sortKey="Ingvorsen, K" uniqKey="Ingvorsen K">K Ingvorsen</name>
</author>
<author>
<name sortKey="Lotz, M" uniqKey="Lotz M">M Lotz</name>
</author>
<author>
<name sortKey="Mcnair, J" uniqKey="Mcnair J">J McNair</name>
</author>
<author>
<name sortKey="Penumetcha, P" uniqKey="Penumetcha P">P Penumetcha</name>
</author>
<author>
<name sortKey="Simpson, S" uniqKey="Simpson S">S Simpson</name>
</author>
<author>
<name sortKey="Voss, L" uniqKey="Voss L">L Voss</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Markowitz, Vm" uniqKey="Markowitz V">VM Markowitz</name>
</author>
<author>
<name sortKey="Mavromatis, K" uniqKey="Mavromatis K">K Mavromatis</name>
</author>
<author>
<name sortKey="Ivanova, Nn" uniqKey="Ivanova N">NN Ivanova</name>
</author>
<author>
<name sortKey="Chen, Im" uniqKey="Chen I">IM Chen</name>
</author>
<author>
<name sortKey="Chu, K" uniqKey="Chu K">K Chu</name>
</author>
<author>
<name sortKey="Kyrpides, Nc" uniqKey="Kyrpides N">NC Kyrpides</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scheunemann, Ae" uniqKey="Scheunemann A">AE Scheunemann</name>
</author>
<author>
<name sortKey="Graham, Wd" uniqKey="Graham W">WD Graham</name>
</author>
<author>
<name sortKey="Vendeix, Fa" uniqKey="Vendeix F">FA Vendeix</name>
</author>
<author>
<name sortKey="Agris, Pf" uniqKey="Agris P">PF Agris</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Maguire, Ba" uniqKey="Maguire B">BA Maguire</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carter, Ap" uniqKey="Carter A">AP Carter</name>
</author>
<author>
<name sortKey="Clemons, Wm" uniqKey="Clemons W">WM Clemons</name>
</author>
<author>
<name sortKey="Brodersen, De" uniqKey="Brodersen D">DE Brodersen</name>
</author>
<author>
<name sortKey="Morgan Warren, Rj" uniqKey="Morgan Warren R">RJ Morgan-Warren</name>
</author>
<author>
<name sortKey="Wimberly, Bt" uniqKey="Wimberly B">BT Wimberly</name>
</author>
<author>
<name sortKey="Ramakrishnan, V" uniqKey="Ramakrishnan V">V Ramakrishnan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mehta, R" uniqKey="Mehta R">R Mehta</name>
</author>
<author>
<name sortKey="Champney, Ws" uniqKey="Champney W">WS Champney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="David Eden, H" uniqKey="David Eden H">H David-Eden</name>
</author>
<author>
<name sortKey="Mankin, As" uniqKey="Mankin A">AS Mankin</name>
</author>
<author>
<name sortKey="Mandel Gutfreund, Y" uniqKey="Mandel Gutfreund Y">Y Mandel-Gutfreund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Duc, Ac" uniqKey="Duc A">AC Duc</name>
</author>
<author>
<name sortKey="Klosi, E" uniqKey="Klosi E">E Klosi</name>
</author>
<author>
<name sortKey="Pattabiraman, S" uniqKey="Pattabiraman S">S Pattabiraman</name>
</author>
<author>
<name sortKey="Spaller, Mr" uniqKey="Spaller M">MR Spaller</name>
</author>
<author>
<name sortKey="Chow, Cs" uniqKey="Chow C">CS Chow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warner, Jr" uniqKey="Warner J">JR Warner</name>
</author>
<author>
<name sortKey="Vilardell, J" uniqKey="Vilardell J">J Vilardell</name>
</author>
<author>
<name sortKey="Sohn, Jh" uniqKey="Sohn J">JH Sohn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kjeldgaard, No" uniqKey="Kjeldgaard N">NO Kjeldgaard</name>
</author>
<author>
<name sortKey="Gausing, K" uniqKey="Gausing K">K Gausing</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="publisher-id">nar</journal-id>
<journal-id journal-id-type="hwp">nar</journal-id>
<journal-title-group>
<journal-title>Nucleic Acids Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0305-1048</issn>
<issn pub-type="epub">1362-4962</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21771858</article-id>
<article-id pub-id-type="pmc">3203614</article-id>
<article-id pub-id-type="doi">10.1093/nar/gkr576</article-id>
<article-id pub-id-type="publisher-id">gkr576</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genomics</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Tripp</surname>
<given-names>H. James</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hewson</surname>
<given-names>Ian</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Boyarsky</surname>
<given-names>Sam</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Stuart</surname>
<given-names>Joshua M.</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zehr</surname>
<given-names>Jonathan P.</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="COR1">*</xref>
</contrib>
</contrib-group>
<aff id="AFF1">
<sup>1</sup>
Department of Ocean Sciences, University of California, Santa Cruz, CA 95064, USA and
<sup>2</sup>
Department of Microbiology, Cornell University, Wing Hall 403, Ithaca, NY 14853, USA</aff>
<author-notes>
<corresp id="COR1">*To whom correspondence should be addressed. Tel:
<phone>831 459 4009</phone>
; Fax:
<fax>831 459 4882</fax>
; Email:
<email>zehrj@ucsc.edu</email>
</corresp>
</author-notes>
<pmc-comment>For NAR both ppub and collection dates generated for PMC processing 1/27/05 beck</pmc-comment>
<pub-date pub-type="collection">
<month>11</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="ppub">
<month>11</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>19</day>
<month>7</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>19</day>
<month>7</month>
<year>2011</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>39</volume>
<issue>20</issue>
<fpage>8792</fpage>
<lpage>8802</lpage>
<history>
<date date-type="received">
<day>24</day>
<month>3</month>
<year>2011</year>
</date>
<date date-type="rev-recd">
<day>24</day>
<month>6</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>6</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2011. Published by Oxford University Press.</copyright-statement>
<copyright-year>2011</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">http://creativecommons.org/licenses/by-nc/3.0</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>In the course of analyzing 9 522 746 pyrosequencing reads from 23 stations in the Southwestern Pacific and equatorial Atlantic oceans, it came to our attention that misannotations of rRNA as proteins is now so widespread that false positive matching of rRNA pyrosequencing reads to the National Center for Biotechnology Information (NCBI) non-redundant protein database approaches 90%. One conserved portion of 23S rRNA was consistently misannotated often enough to prompt curators at Pfam to create a spurious protein family. Detailed examination of the annotation history of each seed sequence in the spurious Pfam protein family (PF10695, ‘Cw-hydrolase’) uncovered issues in the standard operating procedures and quality assurance programs of major sequencing centers, and other issues relating to the curation practices of those managing public databases such as GenBank and SwissProt. We offer recommendations for all these issues, and recommend as well that workers in the field of metatranscriptomics take extra care to avoid including false positive matches in their datasets.</p>
</abstract>
<counts>
<page-count count="11"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec>
<title>INTRODUCTION</title>
<p>Ribosomes are the site of peptide bond formation in all living cells, from bacteria to humans (
<xref ref-type="bibr" rid="B1">1</xref>
). They are composed in part of highly conserved RNA sequences (
<xref ref-type="bibr" rid="B2">2</xref>
) usually coded on DNA in operons of three subunits (16S, 23S and 5S) in Bacteria and Archaea (
<xref ref-type="bibr" rid="B3">3</xref>
,
<xref ref-type="bibr" rid="B4">4</xref>
) and in tandem repeats of longer operons that ultimately mature to four subunits (18S, 25/28S, 5.8S and 5S) in Eukaryotes (
<xref ref-type="bibr" rid="B5">5</xref>
,
<xref ref-type="bibr" rid="B6">6</xref>
). The complete primary nucleotide sequences of representative rRNA subunits in the seven duplicated rRNA operons of
<italic>Escherichia coli</italic>
were published between 1967 and 1978 (
<xref ref-type="bibr" rid="B7 B8 B9 B10">7–10</xref>
). The rRNA nucleotide sequences for
<italic>Saccharomyces cerevisiae,</italic>
which occur in ∼140 tandem repeats, were published between 1972 and 1981 (
<xref ref-type="bibr" rid="B11 B12 B13 B14">11–14</xref>
).</p>
<p>While artificial overexpression of a pentapeptide sequence adjacent to a Shine–Dalgarno motif within
<italic>E. coli</italic>
23S rRNA was found to impart drug resistance to erythromycin (
<xref ref-type="bibr" rid="B15">15</xref>
), rRNA operons in Bacteria and Archaea are not known to contain naturally expressed protein coding regions that also code for rRNA. Also, while antisense transcription was recently reported for Bacterial and Archaeal proteins, that study did not report antisense transcription from Bacteria and Archaea rRNA (
<xref ref-type="bibr" rid="B16">16</xref>
). To be sure, insertion elements can be found in rRNA operons of Bacteria and Archaea, but not sequences that code for rRNA and protein at the same time. Therefore, annotations of Bacteria and Archaea proteins embedded in rRNA operons and overlapping with rRNA coding regions within those operons have been rightly presumed to be misannotations (
<xref ref-type="bibr" rid="B17">17</xref>
) and should continue to be, until hard evidence to the contrary emerges. While these misannotations continue to exist, they have the potential to generate false positive matches of translated environmental rRNA sequences to proteins. To our knowledge, the potential for false positives in metatranscriptomic studies due to misannotations of rRNA operons has not been reported prior to this study.</p>
<p>Unlike Bacterial and Archaeal rRNA operons, the yeast rRNA operon has indeed been shown to contain an embedded protein coding domain sequence (CDS) called Tar1p that overlaps the 5′–end of the DNA sequence coding for the 25S rRNA subunit (
<xref ref-type="bibr" rid="B18">18</xref>
). Another substantive difference between rRNA operons in Bacteria, Archaea and Eukaryotes is that Eukaryotes are also known to contain rRNA sequences that have moved to other parts of the genome including expressed coding regions, putatively with regulatory functions (
<xref ref-type="bibr" rid="B19">19</xref>
). Other Eukaryotic proteins of unknown function having rRNA homology have been reported (
<xref ref-type="bibr" rid="B20 B21 B22">20–22</xref>
). All these Eukaryotic proteins with real rRNA homology are another source of potential false positives in metatranscriptomic studies, since translations of conserved rRNA sequences from other Eukaryotes will match to these protein sequences.</p>
<p>We observed both kinds of false positives in a metatranscriptome of 9 522 746 pyrosequencing reads from 23 stations in the Southwestern Pacific and equatorial Atlantic oceans. When we discovered that the misannotations of Bacterial and Archaeal rRNA sequences were so widespread that a spurious Pfam (
<xref ref-type="bibr" rid="B23">23</xref>
) protein family had been created, we paused in our ecological analysis to assess the extent of these misannotations and to make recommendations on how to address them.</p>
</sec>
<sec sec-type="materials|methods">
<title>MATERIALS AND METHODS</title>
<sec>
<title>Analysis of known
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 expressed intergenic regions</title>
<p>Fasta sequences of the 11 expressed intergenic regions (eIGRs) of
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 (
<xref ref-type="bibr" rid="B24">24</xref>
) were compared to all RNA reads [CAMERA (
<xref ref-type="bibr" rid="B25">25</xref>
) project names CAM_PROJ_PacificOcean and CAM_PROJ_AmazonRiverPlume] using blastn and a bit score cut-off of 40. The blast results were parsed and loaded into a MySQL database containing sample metadata. Blast results and metadata were joined into Structured Query Language (SQL) logical views for analysis of eIGRs by sample. The data in the SQL logical views were summarized and visualized using Microsoft Access and Excel.</p>
</sec>
<sec>
<title>Analysis of gene contexts for PF10695 seed sequences</title>
<p>With the hypothesis that the seed sequences were all embedded in an rRNA operon, overlapping with the 3′-end of the 23S rRNA sequence on the opposite strand from the 23S rRNA sequence, we attempted to extract the full rRNA operon within which, we hypothesized each Pfam seed sequence to exist. Knowing the 3′-end of the 23S rRNA operon to be no >500 bp from the 3′-end of the entire rRNA operon in most Bacteria and Archaea, we chose 500 bp upstream (recall that we hypothesize the seed sequence to be on the opposite strand from the RNA sequence) of the seed sequence as the likely 3′-end of the rRNA operon in which we hypothesized the seed sequence to exist. Knowing that the 5′-end of the rRNA operon is usually no >5000 bp upstream from the 3′-end of the 23S rRNA sequence, we chose 5000 bp downstream of the seed sequence as the likely 5′-start of the rRNA operon in which we hypothesized the seed sequence to exist. For three seed sequences (GI 145845866, 47093546, 121729912), we could not extract the entire region of interest because the contig on which the seed sequence was found, ended prematurely. For two seed sequences (GI 149912432 and 90419149), 6000 and 5500 bp downstream of the seed sequence was required to reach the 5′-end of the rRNA operon.</p>
<p>The GenBank protein identifiers of the 10 seed sequences for PF10695 (GI 81390223, 122460149, 30316295, 122409581, 122668550, 149912438, 150010506, 121729919, 154505437, 154487654) were obtained from the Web site for the NCBI Conserved Domain (pfam10695) at
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=151191">http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=151191</ext-link>
. The start and end coordinates for the seed sequences were found in two different ways, depending on the sequence. For SwissProt seed sequences, the Exon Information area at the bottom of the ExonView screen gave the strand (plus or minus), and the start and end coordinates for the seed sequence. For GenBank proteins, the ‘/coded_by’ entry in the CDS feature for the record gave the same information. The nucleotide accessions and coordinates for the gene contexts surrounding the PF10695 seed sequences can be found in an Excel file in
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
.</p>
<p>Using the nucleotide accession information for the seed sequences, we then calculated expected 5′- and 3′-ends of the rRNA operon containing the seed sequences, as described above, and extracted the context from GenBank in GenBank format. We used Artemis (
<xref ref-type="bibr" rid="B26">26</xref>
) to reannotate the rRNA sequences using either the RNAmmer 1.2 Server (
<ext-link ext-link-type="uri" xlink:href="http://www.cbs.dtu.dk/services/RNAmmer/">http://www.cbs.dtu.dk/services/RNAmmer/</ext-link>
) for complete rRNA operons, or blastn against GenBank's nucleotide database for incomplete rRNA operons. The reannotated Artemis screens were then exported to a graphics file and traced to scale in PowerPoint. The result is shown in
<xref ref-type="fig" rid="F1">Figure 1</xref>
.
<fig id="F1" position="float">
<label>Figure 1.</label>
<caption>
<p>Gene contexts of
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 eIGR1 and eIGR11. The dotted lines indicate that the rRNA were not annotated originally, but were found in this study using the RNAmmer web site. The broken lines on the scale bar show that the 5S rRNA gene in
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 is found in another part of the genome from the adjacent 16S and 23S rRNA genes.</p>
</caption>
<graphic xlink:href="gkr576f1"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Visualization of historical rRNA annotations</title>
<p>We used the Web Site for the GOLD Database,
<ext-link ext-link-type="uri" xlink:href="http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi">http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi</ext-link>
, to obtain a list of complete genomes sorted in sequential order of creation. Starting with GOLD identifier ‘Gc00001’, we navigated to the summary page for the genome in the Integrated Microbial Genomes (IMG) database (
<xref ref-type="bibr" rid="B27">27</xref>
). We added all rRNA genes to the Gene Cart and used the ‘Show Neighborhood’ feature to visualize the gene contexts of all rRNA genes in the genome.</p>
</sec>
<sec>
<title>Determination of original misannotation of ‘cell wall hydrolase’ in PF10695</title>
<p>We searched for PF10695 on the UniProtKB/Swiss-Prot web site (
<ext-link ext-link-type="uri" xlink:href="http://www.uniprot.org/">http://www.uniprot.org/</ext-link>
), and sorted all accessions for PF10695 by ‘Date of creation’. The first record displayed (Q8CME1, created 2003-03-01) had a protein name of ‘Cell wall-associated hydrolase’. The ‘gene names’ column for this accession, listed nine gene loci (VV1_0473, VV1_0917, VV1_0925, VV1_0970, VV1_1072, VV1_1190, VV1_1418, VV1_1502, VV2_1450), all for the organism
<italic>Vibrio vulnificus</italic>
. Using the link for accession Q8CME1, we navigated to each GenBank protein accession for these loci (AAO08321.1, AAO08995.1, AAO09418.1, AAO09424.1, AAO09463.1, AAO09551.1, AAO09653.1, AAO09859.1, AAO09933.1). GenBank reported that all records had been removed, but the obsolete versions were accessible. Each obsolete accession had a note saying ‘similar to invasion-associated proteins; COG0791’. In order to determine when the records were deleted, we clicked on the ‘Revision History’ radio button under ‘Display Settings’ for the protein record display. It showed that the records were removed on 4 January 2006, having been first seen on 22 December 2002.</p>
<p>We determined that the similarity to COG0791 reported in the obsolete protein records was an error. To do this, we searched the Clusters of Orthologous Groups (COG) database for the amino acid sequences using NCBI's Conserved Domain web site (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi">http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi</ext-link>
) and found no match to any COG using the default cut-off of
<italic>E</italic>
 = 0.01. Loosening the cut-off to
<italic>E</italic>
 = 100, we found poor matches (
<italic>E</italic>
 > 0.84) to four COGs, none of which were COG0791.</p>
<p>To confirm that the genome context for all the obsolete proteins were within the rRNA operons of
<italic>V. vulnificus</italic>
, we obtained their nucleotide accessions and coordinates from the ‘coded_by’ feature of their CDS. We downloaded the FASTA nucleotide records and verified that the nucleotide sequences for all loci were identical. We then performed a blastn search of each nucleotide sequence against the
<italic>V. vulnificus</italic>
CMCP6 genome using the NCBI Web site, and they all returned one match to rRNA-23S ribosomal RNA and no other genome feature. This confirmed that all nine obsolete loci were annotated as embedded, overlapping Open Reading Frames (ORFs) with an rRNA 23S sequence before they were deleted from GenBank.</p>
</sec>
<sec>
<title>Analysis of spurious ORFs in
<italic>E. coli</italic>
rRNA sequences</title>
<p>The EMBOSS getorf utility was used to generate spurious ORFs in the
<italic>rrsH</italic>
and
<italic>rrlH</italic>
genes of
<italic>E. coli</italic>
, using a permissive parameter of 100 nt from any methionine codon to any stop codon. The translated protein sequences from the spurious ORFs were compared to a copy of the NCBI non-redundant (nr) protein database (January 2011) with a cut-off of
<italic>E</italic>
 = 0.001. Custom Perl scripts for parsing the blast output and for retrieving GenBank data were used to identify and fetch the nucleotide sequences for the protein matches to the spurious ORFs. These nucleotide sequences were compared to a copy of the SILVA (
<xref ref-type="bibr" rid="B28">28</xref>
) rRNA database (June 2010) using blastn. The blastn output was parsed with a custom Perl script. If the nucleotide sequence for a protein matched a SILVA rRNA sequence at 90% nucleotide identity over 90% of its length, it was considered a misannotated protein. The misannotated proteins were mapped back to their corresponding spurious ORF from
<italic>E. coli</italic>
rRNA in an Excel spreadsheet, which is included in
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
. The results were visualized as shown in
<xref ref-type="fig" rid="F4">Figure 4</xref>
.</p>
</sec>
<sec>
<title>Generation and analysis of pseudoreads</title>
<p>The nucleotide sequences of the rRNA subunits from
<italic>E. coli</italic>
str. K-12, substr. MG1655
<italic>, Sulfolobus acidocaldarius</italic>
DNS 639
<italic>,</italic>
and
<italic>S. cerevisiae</italic>
S288c were retrieved from GenBank in fasta format. A custom Perl script then removed the fasta headers from this file and concatenated all of the sequence data for all of the rRNAs into one long string. A second Perl script generated 10 000 pseudoreads from this long string by choosing a starting point at random, then pulling a randomly-chosen number of base pairs from a file containing the read lengths of an actual pyrosequencing run. The pseudoreads thus generated, were written to a fasta file that was queried against the 28 April 2010 copy of NCBI's nr database, using blastx (version 2.2.21) and a cutoff of
<italic>E</italic>
 < 0.001, with 25 summaries and alignments retained. The blastx text output was read into MEGAN (
<xref ref-type="bibr" rid="B29">29</xref>
) version 3.9. The phylogeny and function of the proteins matching the pseudoreads were visualized in MEGAN using default parameters.</p>
</sec>
</sec>
<sec>
<title>RESULTS AND DISCUSSION</title>
<sec>
<title>Search for eIGRs</title>
<p>As described in the ‘Materials and Methods’ section, we searched for known eIGRs from
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 in our metatranscriptome, in order to compare our results with the study that discovered eIGRs (
<xref ref-type="bibr" rid="B24">24</xref>
). The most commonly occurring
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 eIGR in our dataset was eIGR1 (
<xref ref-type="table" rid="T1">Table 1</xref>
). A blastn comparison of the genome nucleotide coordinates for eIGR1 (GI 254455249:44547; 44776, 230 bp long) to the GenBank nucleotide database revealed that Positions 77–230 of eIGR1 matched with 99% identity to the single 23S rRNA gene of
<italic>Candidatus Pelagibacter</italic>
ubique HTCC1062. Positions 118–187 of eIGR1 matched with 83% identity to all seven 23S rRNA genes of
<italic>E. coli</italic>
K-12
<italic>,</italic>
confirming that the
<italic>Candidatus Pelagibacter</italic>
ubique HTCC1062 23S annotation was reasonably accurate and that at least half of eIGR1 contained unannotated 23S rRNA sequence
<italic>.</italic>
When we examined the larger context of the eIGR1 region of the
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 in the NCBI genome browser, we saw that it was flanked by a ‘cell wall hydrolase’ (Pfam PF10695) on one side. The ‘cell wall hydrolase’ was in turn flanked by a long stretch of unannotated sequence. We extracted the nucleotides of the entire unannotated region of
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 near eIGR1 (NZ_DS995298.1:43539-50898, 7359 bp) and submitted the region to the RNAmmer 1.2 WebServer (
<xref ref-type="bibr" rid="B30">30</xref>
). The predicted 5′-end of the 23S extended well into the eIGR1 region (
<xref ref-type="fig" rid="F1">Figure 1</xref>
). This meant that gene locus PB7211_763 (‘cell wall hydrolase’) of
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 overlapped a 23S sequence on the antisense strand, something that has always been considered an annotation error in Bacteria and Archaea. However, the protein sequence of PB7211_763 returned a strong (
<italic>E</italic>
 = 1.24e-44) match to pfam10695, Cw-hydrolase, using the search function of the NCBI Conserved Domain web site. We found this result surprising and worth investigating further.
<table-wrap id="T1" position="float">
<label>Table 1.</label>
<caption>
<p>Rank order listing of
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 eIGRs found in marine metatranscriptomes</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">eIGR</th>
<th rowspan="1" colspan="1">Count</th>
<th rowspan="1" colspan="1">Comment</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">22 103</td>
<td rowspan="1" colspan="1">Unannotated 23S rRNA (this study)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">3257</td>
<td rowspan="1" colspan="1">Unannotated RNase P [Shi
<italic>et al</italic>
. (
<xref ref-type="bibr" rid="B24">24</xref>
)]</td>
</tr>
<tr>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">1959</td>
<td rowspan="1" colspan="1">Unannotated putative tmRNA [Shi
<italic>et al</italic>
. (
<xref ref-type="bibr" rid="B24">24</xref>
)]</td>
</tr>
<tr>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">243</td>
<td rowspan="1" colspan="1">Unannotated 5S rRNA (this study)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">73</td>
<td rowspan="1" colspan="1">Glycine-activated riboswitch</td>
</tr>
<tr>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">61</td>
<td rowspan="1" colspan="1">Glycine-activated riboswitch</td>
</tr>
<tr>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">36</td>
<td rowspan="1" colspan="1">Unknown</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">Unknown</td>
</tr>
<tr>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">Unknown</td>
</tr>
<tr>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">Unknown</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">Unknown</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TF1">
<p>The numbering of the eIGRs is taken from Shi
<italic>et al</italic>
. (
<xref ref-type="bibr" rid="B24">24</xref>
). The comment column describes the content of the intergenic region, if known.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>Analysis of seed sequences for PF10695</title>
<p>In order to determine how well PF10695 was characterized, we obtained all 10 of its seed sequences from the NCBI Conserved Domain web site. We then examined the gene contexts for each seed sequence as described in the ‘Materials and Methods’ section and found that all 10 seed sequences were in fact embedded and overlapping ORFs within rRNA operons (
<xref ref-type="fig" rid="F2">Figure 2</xref>
), just as PB7211_763 of
<italic>Candidatus Pelagibacter</italic>
sp. HTCC7211 was. Clearly, PF10695 had been created in error from misannotations. We asked how this had come about.
<fig id="F2" position="float">
<label>Figure 2.</label>
<caption>
<p>Gene context for seed sequences of Pfam 10695. The seed sequences are shown in green. The 16S, 23S and 5S sequences are shown in red, blue and white with dotted outlines for those sequences that were not annotated by the sequencing center shown, but were found in this study using the RNAmmer 1.2 Web server. Other embedded ORFs within the rRNA operon are shown in brown.</p>
</caption>
<graphic xlink:href="gkr576f2"></graphic>
</fig>
</p>
<p>The ‘Materials and Methods’ section describes how we determined the first misannotation of an embedded and overlapping ORF annotated ‘cell wall hydrolase’ within an rRNA operon. The first misannotation was made in nine copies of rRNA operons in
<italic>V. vulnificus</italic>
CMCP6
<italic>.</italic>
The misannotation was eventually corrected by deleting all nine protein sequences from GenBank; however they were active for ∼3 years in GenBank before being deleted and are still active in SwissProt. The ‘Materials and Methods’ section describes exactly how we found the original misannotation of ‘cell wall hydrolase’ using the SwissProt (UniProtKB) database. During the course of that investigation, we noted that the only ‘reviewed’ record for PF10695 was locus TC_0114 of
<italic>Chlamydia muridarum</italic>
, one of the seed sequences for PF10695. Since SwissProt curators had reviewed this locus, we examined Revision Histories in GenBank and SwissProt (UniProtKB) in detail to see what basis they might have found for this being a valid protein.</p>
<p>It appears that at some point, a SwissProt curator might have thought that there was some experimental evidence for TC_0114, even though none of the 41 versions of it in SwissProt (accession Q9PLI5) contain such a notation. The indication of experimental evidence comes from GenBank's record of the Swiss-Prot protein sequence for TC_0114 (protein accession Q9PLI5, GI 30316295). It currently shows a feature of ‘/experiment = “experimental evidence, no additional details recorded”’ added 13 April 2006. However, GenBank's protein accession for TC_0114 (protein accession AAF38993, GI 29251569) has a note saying ‘identified by Glimmer2; putative’ and makes no mention of experimental evidence. We could find no literature supporting experimental evidence for TC_0114 and conclude that it in fact was a spurious prediction of Glimmer2 and was incorrectly reviewed by SwissProt.
<fig id="F3" position="float">
<label>Figure 3.</label>
<caption>
<p>Comparison of annotations of
<italic>P. horikoshii</italic>
OT3. This figure compares the annotations of the 16S–23S rRNA operon of
<italic>P. horikoshii</italic>
OT3, the 14th genome annotated. The 5S rRNA sequence is located elsewhere in the genome. Coloring and abbreviations are the same as
<xref ref-type="fig" rid="F1">Figure 1</xref>
.</p>
</caption>
<graphic xlink:href="gkr576f3"></graphic>
</fig>
</p>
<p>A likely origin of protein family, PF10695 was now discernable. From late 2002 to early 2006, the spurious ORFs in the unannotated 23S rRNA operon of
<italic>V. vulnificus</italic>
were stored in GenBank with a product of ‘cell wall hydrolase’. They were very similar (74% amino acid identity) to the incorrectly reviewed SwissProt entry for TC_0114 of
<italic>C. muridarum</italic>
. Annotators or pipelines using Glimmer2 for gene finding would have found a SwissProt ‘reviewed’ protein and a protein annotated ‘cell wall hydrolase’ embedded and overlapping 23S rRNA sequences. On the basis of this evidence, some annotators or pipelines evidently called their spurious ORF ‘cell wall hydrolase’, while others called them ‘conserved hypothetical’. Others saw weak or erroneous matches to other proteins and annotated them from those matches. Thus, a variety of annotations arose for spurious embedded, overlapping proteins within rRNA operons, the most common of which was ‘cell wall hydrolase’. When these annotations accumulated to sufficient levels, it apparently prompted Pfam to create the protein family PF10695, ‘Cw-hydrolase’. As of this study, PF10695 contains 1780 NCBI proteins and 1653 metagenomic fragments.</p>
<p>The staff at Pfam reviewed this article, concurs that PF10695 is spurious and has marked it for deletion in release 26.0 (A. Bateman, personal communication). They informed us that four other families were deleted in the past for the same reason (PF07612, PF07616, PF07630 and PF07633) and another (PF05330) was deleted because it contained spurious human genes based on a repeat.</p>
</sec>
<sec>
<title>Additional misannotations of rRNA in GenBank</title>
<p>The additional misannotations of embedded, overlapping proteins within rRNA operons shown in brown in
<xref ref-type="fig" rid="F2">Figure 2</xref>
indicated that the misannotation of rRNA operons was not confined to pfam10695. Therefore, we inquired into their origin as well. To do this, we obtained a date-sorted list of all microbial genomes in the Genomes OnLine Database (GOLD) (
<xref ref-type="bibr" rid="B31">31</xref>
), and visualized the gene contexts of their rRNA operons, starting with
<italic>Haemophilus influenza,</italic>
as described in the ‘Materials and Methods’ section. The first annotated rRNA operon with protein sequences overlapping with and embedded in an rRNA operon appeared in the 1998 genome annotation for
<italic>Pyrococcus horikoshii</italic>
OT3 (
<xref ref-type="bibr" rid="B32">32</xref>
), the 14th complete genome sequenced. The authors commented in their original submission to GenBank (BA000001.2) that, ‘All the sequence with length 100 codons or more between ATG and GTG and stop codon are defined as CDS’. They also said that ORFs as small as 50–99 codons long were also considered probable protein-coding regions if they showed some similarity to proteins in public databases. Summarizing their approach, the authors explained, ‘It should be noted that the ORFs mentioned above merely represent the protein-coding potentiality under the defined assumptions’. Nothing was said about eliminating overlapping ORFs; apparently these were retained either deliberately or inadvertently.</p>
<p>There are two potential reasons why the authors might have taken a CDS-finding approach so prone to false positives. First, their study organism was an Archaeon, the least studied domain of life, and they might have preferred to call false positives rather than to miss a novel Archaeon protein. Second, they may have noted reports (
<xref ref-type="bibr" rid="B33 B34 B35">33–35</xref>
) that genes arising from horizontal gene transfer are sometimes missed by CDS-finding algorithms that rely on codon frequencies (
<xref ref-type="bibr" rid="B36">36</xref>
) or Markov chain models (
<xref ref-type="bibr" rid="B37">37</xref>
) of ‘typical’ genes in the genome. Whatever their reasoning, these authors provided a genome with multiple protein coding domains overlapping rRNA genes in rRNA operons (
<xref ref-type="fig" rid="F3">Figure 3</xref>
). At the same time, these authors erroneously called the end of the 23S rRNA subunit, an error that was discovered and corrected by RefSeq curators
<italic>.</italic>
Although they corrected the length of the 23S rRNA subunit, the RefSeq curation staff did not remove all of the overlapping ORFs inside the rRNA operon. Interestingly, they lengthened one of the overlapping ORFs so that it overlapped with another one, creating a triple overlap (
<xref ref-type="fig" rid="F3">Figure 3</xref>
). As a result, neither the nr database nor the RefSeq database at NCBI contain what we presume to be the correct annotation (
<xref ref-type="fig" rid="F3">Figure 3</xref>
).</p>
<p>We were able to demonstrate that, at least 367 additional genomes in NCBI's nr database have misannotated proteins (
<xref ref-type="fig" rid="F4">Figure 4</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
). To demonstrate this, we intentionally generated a large number of spurious ORFs in
<italic>E. coli</italic>
16S and 23S rRNA sequences (
<xref ref-type="fig" rid="F4">Figure 4</xref>
, top left and bottom) and counted the close matching nr proteins whose nucleotide sequences also had a very strong match to rRNA sequences in the SILVA database (
<xref ref-type="fig" rid="F4">Figure 4</xref>
, upper right). The majority of the spurious ORFs had at least one nr protein hit whose nucleotide sequence also matched a SILVA rRNA sequence at >90% nucleotide identity over 90% of its length. Some spurious ORFs had well over 100 such matches to misannotated proteins. When the accessions and associated organism names for all misannotated proteins were combined, it emerged that genome sequences for 367 organisms in nr contained misannotated proteins (
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
).
<fig id="F4" position="float">
<label>Figure 4.</label>
<caption>
<p>NCBI nr Hits to Spurious ORFs in
<italic>E. coli</italic>
rRNA. Top left, spurious ORFs in
<italic>E. coli</italic>
16S rRNA. Bottom, spurious ORFs in
<italic>E. coli</italic>
23S rRNA. Scales are in base pairs. White arrows, three reading frames on positive strand, gray arrows, three reading frames on negative strand. Inset at upper right shows the log
<sub>10</sub>
of the number of NCBI nr protein hits to the translated amino acids for each spurious ORF in both of the
<italic>E. coli</italic>
rRNA sequences. The NCBI nr protein hit was only counted if its nucleotide sequence matched a known rRNA sequence in the SILVA database. The detail for each hit is provided in
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
.</p>
</caption>
<graphic xlink:href="gkr576f4"></graphic>
</fig>
</p>
</sec>
<sec>
<title>Misannotations of rRNA in the SEED</title>
<p>Having found this instance of overlapping ORFs embedded within rRNA operons in only the 14th genome sequenced, we continued looking for them in subsequent genomes. The 27th genome sequenced,
<italic>Chlamydophila pneumonia</italic>
AR39 (GOLD identifier Gc00027, completed 15 March 2000), contained an annotation for a putative hypothetical protein at gene locus CP0987, overlapping a 23S rRNA sequence, thus indicating that it coded for protein and rRNA at the same time, exactly as was the case for PF10695 seed sequences. The original GenPept accession for CP0987 (AAF38766.1) has been removed as obsolete but the accession created for it in RefSeq (NP_445524.1) is still active. More significantly, the Gene Detail for CP0987 in IMG (Object 637042263) showed this SEED (
<xref ref-type="bibr" rid="B38">38</xref>
) identifier for CP0987: ‘Retron-type reverse transcriptase, fig|115711.7.peg.950’. As we said in the ‘Introduction’ section, while inserts of protein sequences are known in Bacteria and Archaea, dual coding of protein and rRNA at the same time is not known in Bacteria and Archaea. This indicated that the SEED database also had errors in it.</p>
<p>In order to find misannotations in The SEED Viewer, we did an identifier search for fig|115711.7.peg.950 and asked for 64 comparable regions on the Annotation Overview page. The system returned 30 gene contexts, all showing a gene of similar length to CP0987, all completely overlapping a 23S rRNA gene in the exact same manner as the seed sequences for PF10695. All 30 genes were annotated ‘Retron-type reverse transcriptase’. The Web page said that these features were part of a subsystem called ‘Group II intron-associated genes’, however the subsystem had not been classified for the organism. We clicked on the link to ‘Group II intron-associated genes’, but there was no diagram of a subsystem and no literature listed in the ‘Functional Roles’ tab. We visually inspected the 30 gene contexts and found additional embedded and overlapping ORFs in the 23S rRNA sequences displayed. Of the 30 contexts, 12 also showed a conserved hypothetical protein that also completely overlapped the 23S rRNA, just down-stream of the misannotated ‘Retron-type reverse transcriptase’. Three of the 30 contexts also showed a different conserved hypothetical protein that also completely overlapped the 23S rRNA sequence, and three more contexts showed from one to three very small (<53 amino acids) conserved hypothetical proteins completely overlapping a 23S rRNA sequence. By ‘completely overlapping’ we mean ‘coding for rRNA and protein at the same time’.</p>
<p>This confirmed that there are errors in the SEED, despite the care taken by the current Rapid Annotations using Subsystems Technology (RAST) pipeline not to add any more errors. The RAST pipeline begins by calling tRNA and rRNA genes and ‘the server will not consider retaining any protein-encoding genes that are embedded in rRNAs. These gene calls are almost certainly artefacts of the period in which groups were learning how to develop proper annotations, and RAST attempts to avoid propagating these errors’. Still, we assert that existing misannotations should be found and corrected.</p>
</sec>
<sec>
<title>Eukaryotic pseudo rRNA genes and antisense transcripts to rRNA genes</title>
<p>Having fully addressed the overrepresented eIGRs in our dataset, we now looked for overrepresented protein sequences. We found them in some Eukaryotic sequences (
<xref ref-type="table" rid="T2">Table 2</xref>
). Again, the reason for the overrepresentation was rooted in sequence similarity between the protein sequences and known rRNA sequences. However, we discovered that homology between ‘senescence associated proteins’ and rRNA sequences have in fact been reported in studies of Eukaryotes (
<xref ref-type="bibr" rid="B22">22</xref>
), along with the other examples of protein–rRNA homology noted in the ‘Introduction’ section (
<xref ref-type="bibr" rid="B18 B19 B20 B21">18–21</xref>
). We could not determine whether the protein in
<italic>Chlamydomonas</italic>
with similarity to 18S rRNA was a misannotation or a confirmed protein with homology to 18S rRNA, due to difficulties in annotation of Eukaryotic rRNA sequences have been discussed in the literature (
<xref ref-type="bibr" rid="B30">30</xref>
). However, it was clear that at least some real Eukaryotic proteins with rRNA homology do in fact exist, and therefore have the potential to generate false positives in metatranscriptomic studies.
<table-wrap id="T2" position="float">
<label>Table 2.</label>
<caption>
<p>Amino acid and nucleotide comparison of highly represented putative mRNAs</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Putative mRNAs</th>
<th rowspan="1" colspan="1">Organism</th>
<th rowspan="1" colspan="1">KEGG annotation</th>
<th rowspan="1" colspan="1">Base pairs/ AA</th>
<th rowspan="1" colspan="1">Prog</th>
<th rowspan="1" colspan="1">ID (%)</th>
<th rowspan="1" colspan="1">Len (%)</th>
<th rowspan="1" colspan="1">NCBI nt/nr best specific hit</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="2" colspan="1">42 003</td>
<td rowspan="2" colspan="1">
<italic>Phaeodactylum</italic>
</td>
<td rowspan="2" colspan="1">Hypothetical protein (pti:PHATRDRAFT_37403)</td>
<td rowspan="1" colspan="1">324</td>
<td rowspan="1" colspan="1">blastn</td>
<td rowspan="1" colspan="1">99</td>
<td rowspan="1" colspan="1">96</td>
<td rowspan="1" colspan="1">Uncultured organism 28S rRNA</td>
</tr>
<tr>
<td rowspan="1" colspan="1">107</td>
<td rowspan="1" colspan="1">blastp</td>
<td rowspan="1" colspan="1">90</td>
<td rowspan="1" colspan="1">84</td>
<td rowspan="1" colspan="1">Senescence-associated protein</td>
</tr>
<tr>
<td rowspan="2" colspan="1">19 678</td>
<td rowspan="2" colspan="1">
<italic>Ostreococcus</italic>
</td>
<td rowspan="2" colspan="1">Predicted protein (olu:OSTLU_9775)</td>
<td rowspan="1" colspan="1">264</td>
<td rowspan="1" colspan="1">blastn</td>
<td rowspan="1" colspan="1">99</td>
<td rowspan="1" colspan="1">99</td>
<td rowspan="1" colspan="1">Uncultured organism 28S rRNA</td>
</tr>
<tr>
<td rowspan="1" colspan="1">88</td>
<td rowspan="1" colspan="1">blastp</td>
<td rowspan="1" colspan="1">90</td>
<td rowspan="1" colspan="1">102</td>
<td rowspan="1" colspan="1">Senescence-associated protein</td>
</tr>
<tr>
<td rowspan="2" colspan="1">13 880</td>
<td rowspan="2" colspan="1">
<italic>Chlamydomonas</italic>
</td>
<td rowspan="2" colspan="1">Hypothetical protein (cre:CHLREDRAFT_155068)</td>
<td rowspan="1" colspan="1">264</td>
<td rowspan="1" colspan="1">blastn</td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">88</td>
<td rowspan="1" colspan="1">Uncultured organism 28S rRNA</td>
</tr>
<tr>
<td rowspan="1" colspan="1">87</td>
<td rowspan="1" colspan="1">blastp</td>
<td rowspan="1" colspan="1">26</td>
<td rowspan="1" colspan="1">30</td>
<td rowspan="1" colspan="1">Unknown protein (
<italic>Glycine max</italic>
)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TF2">
<p>The blastn and blastp matches are shown for the three most abundant putative mRNA transcripts, which accounted together for (75 561/904 042)= 8.36% of total putative mRNA. AA, amino acids; Prog, program; ID, identity of best match; Len, length of alignment to best match.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>MEGAN analysis of false positive protein matches to pseudoreads of rRNA</title>
<p>In order to gauge the full scope of potential false positives due to misannotations of rRNA operons in Bacteria and Archaea and to Eukaryotic proteins with rRNA homology, we used MEGAN software to perform phylogenetic and functional analysis of spurious protein matches to ‘pseudoreads’ of rRNA (
<xref ref-type="fig" rid="F5">Figure 5</xref>
). Nearly 90% of the pseudoreads had hits to which phylogeny could be assigned. Despite the fact that the pseudoreads of rRNA came from only three model organisms (
<italic>E. coli, S. acidocaldarius</italic>
and
<italic>S. cerevisae</italic>
), the spurious phylogenetic analysis included Bacteroidetes, Firmicutes, Alpha proteobacteria and a striking diversity of Eukaryotic taxa, including pine tree and chicken. Despite the fact that all the pseudoreads were rRNA sequences with no protein function, functional analysis in MEGAN was possible on (1807/10 000) = 18% of the reads (
<xref ref-type="table" rid="T3">Table 3</xref>
). Not surprisingly, the most frequent functional category was ‘cell wall hydrolase’.
<fig id="F5" position="float">
<label>Figure 5.</label>
<caption>
<p>MEGAN phylogenetic analysis of pseudoreads. The phylogeny of proteins in nr matching 10 000 translated pseudoreads of rRNA taken from
<italic>E. coli, Sulfolobus,</italic>
and
<italic>S. cerevisiae</italic>
is shown, with the number of reads classified to the taxon shown.</p>
</caption>
<graphic xlink:href="gkr576f5"></graphic>
</fig>
<table-wrap id="T3" position="float">
<label>Table 3.</label>
<caption>
<p>MEGAN functional analysis of pseudoreads</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">GO term</th>
<th rowspan="1" colspan="1">Number of reads</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Hydrolase activity</td>
<td rowspan="1" colspan="1">563</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Mitochondrion</td>
<td rowspan="1" colspan="1">317</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Regulation of cellular respiration</td>
<td rowspan="1" colspan="1">250</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Phosphatidylcholine phospholipase C activity</td>
<td rowspan="1" colspan="1">154</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Integral to membrane</td>
<td rowspan="1" colspan="1">139</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Metabolic process</td>
<td rowspan="1" colspan="1">126</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Chloroplast</td>
<td rowspan="1" colspan="1">117</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Catalytic activity</td>
<td rowspan="1" colspan="1">108</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Endonuclease activity</td>
<td rowspan="1" colspan="1">31</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Transferase activity</td>
<td rowspan="1" colspan="1">2</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Total</td>
<td rowspan="1" colspan="1">1807</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TF3">
<p>The functions of proteins in nr matching 10 000 translated pseudoreads of rRNA taken from
<italic>E. coli, S. acidocaldarius</italic>
and
<italic>S. cerevisiae</italic>
are shown, with the number of reads classified to the Gene Ontology (GO) term shown
<italic>.</italic>
</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>Analysis of standard operating procedures of major sequencing centers</title>
<p>The Standard Operating Procedures (SOPs) of the four major sequencing centers participating in the Human Microbiome Project (available at
<ext-link ext-link-type="uri" xlink:href="http://hmpdacc.org/tools_protocols/tools_protocols.php">http://hmpdacc.org/tools_protocols/tools_protocols.php</ext-link>
;
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
) show a complete reliance on Rfam and RNammer to find rRNA genes. None have adapted Niels Larsen's ‘search_for_rnas’ (available from the author) to use blastn searches of known rRNAs. In a recent comparison (
<xref ref-type="bibr" rid="B39">39</xref>
), it was found that two pipelines that have adapated ‘search_for_rnas,’ the Integrated Microbial Genomes Expert Review (IMG_ER) (
<xref ref-type="bibr" rid="B40">40</xref>
) pipeline of the Joint Genome Institute (JGI;
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
) and the RAST pipeline (
<xref ref-type="bibr" rid="B17">17</xref>
), correctly found all three rRNA subunits of
<italic>Halorhabdus utahensis.</italic>
The J Craig Venter Institute (JCVI) pipeline, which relies on RNammer (
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
), found only one-third of the 16S sequence and no 23S sequence at all. Only the 5S was found correctly by JCVI. Also, RNammer does not even attempt to find Eukaryotic 5.8S and performs poorly on Archaeal 5S and Eukaryotic 18S sequences (
<xref ref-type="bibr" rid="B30">30</xref>
). Additionally, we found in this study that while it works well for complete sequences, it does not work for incomplete sequences, such as are found at the ends of contigs in draft sequences. True, RNammer is rapid and consistent, as the authors claim, but it is not accurate enough in our opinion to be used completely on its own for all draft and completed genome sequences. Therefore, it is our opinion that all of the SOPs of sequencing centers associated with the Human Microbiome Project would be improved by adapting ‘search_for_rnas’, as has been done with the IMG-ER and RAST pipelines, in addition to using RNammer for initial rRNA finding.</p>
<p>An additional improvement to all SOPs associated with the Human Microbiome Project is an explicit manual or automated check to see if in fact any rRNA operons were found at all prior to doing any gene finding and elimination of overlaps. The check should also consider whether the size and organization of the putative rRNA operons depart from the known size and organization for the sequenced organism's domain of life. An automated check could be fashioned by querying the nucleotide sequences of all putative proteins against a known database of rRNA with blastn. A match indicates that the putative protein is likely to be a spurious ORF within an rRNA operon.</p>
<p>Gene finding should only proceed after it has been verified that all rRNAs have in fact been found and annotated with accurate starts and ends. If gene finding is done prior to rRNA finding, or after rRNA finding has failed or was done inaccurately, it is almost certain that spurious ORFs will be found with no apparent overlap to any other feature. It does not appear that any of the sequencing centers for the Human Microbiome Project have a quality assurance checkpoint of making sure that rRNA operons have in fact been found and properly annotated prior to proceeding with gene finding. In addition, one center (JCVI) does not mention eliminating protein coding domains overlapping rRNA operons at all in their SOP. All sequencing centers should assure that rRNA operons have been found and properly annotated before finding genes and eliminating overlaps, which is to say putative proteins that also code for rRNA.</p>
</sec>
<sec>
<title>Getting GenBank corrections to propagate</title>
<p>The case study of PF10695 demonstrates that errors often propagate, but corrections often do not. The corrections least likely to propagate are deletions of erroneous records. Often, the only record of deletion occurs in text comments; there is no file of deletions for easy programmatic handling. It is cumbersome to write programs that scan the entire history of GenBank, parsing text comments to look for deletions and corresponding replacements. GenBank itself apparently does not write such programs. As we showed above, the original GenPept accession for spurious protein CP0987 (AAF38766.1) has been removed as obsolete, but the accession created for it in RefSeq (NP_445524.1) is still active.</p>
</sec>
<sec>
<title>Importance of accurate rRNA operon annotation</title>
<p>Accurate rRNA operon annotation is likely to improve drug discovery and understanding of cellular regulatory processes. There is ample literature describing drug effects on ribosomal metabolism (
<xref ref-type="bibr" rid="B41 B42 B43 B44 B45 B46">41–46</xref>
). Use of this literature certainly requires that all ribosomal subunits be annotated, and effective use requires that the annotations be accurate. In addition, ribosome biosynthesis has long been known to be a major cellular activity, especially in growing cells (
<xref ref-type="bibr" rid="B47">47</xref>
,
<xref ref-type="bibr" rid="B48">48</xref>
). The majority of RNA recovered in metatranscriptomic studies is rRNA, not mRNA. Accurate annotation relating to regulation of ribosome biosynthesis, including promoter locations and binding sites, and further annotation of confirmed antisense proteins such as those found in yeast, is important as well. Future biochemical studies may indeed find that some of the antisense proteins overlapping rRNA sequence in Bacteria and Archaea are in fact expressed and translated, as they are in Eukaryotes. However, this mere potential is no reason to reverse the longstanding, prudent practice of eliminating putative ORFs that overlap Bacterial and Archaeal rRNA sequences and have no confirmed wet or dry lab evidence for their existence, other than their length being over 50–100 codons.</p>
</sec>
</sec>
<sec>
<title>CONCLUSION</title>
<p>Widespread misannotation of spurious ORFs in Bacterial and Archaeal rRNA operons and the existence of Eukaryotic proteins with homology to rRNA combine to create the potential for a false positive rate of 90% in metatranscriptomic studies. Standard Operating Procedures for major sequencing centers should be amended to include a quality assurance checkpoint verifying that rRNA operons of appropriate length have been found before gene finding and elimination of overlapping ORFs proceeds and the JCVI SOP should include elimination of spurious ORFs within Bacterial and Archaeal rRNA operons. Pipelines that do not make use of the ‘search_for_rnas’ program would be improved by adapting it, especially for draft genomes, instead of relying completely on RNammer for all rRNA finding.</p>
<p>The spurious protein family PF10695, whose seed sequences are all misannotations, will be deleted from Pfam in release 26.0. All CDS annotations referring to this protein family (1780 in NCBI alone) need to be deleted from public databases. In addition, all Bacterial and Archaeal proteins whose nucleotide sequences have a significant match to known rRNA sequences need to be deleted. NCBI might consider providing monthly files of deleted proteins to assist in propagating these corrections and indeed all corrections involving deletion of spurious putative protein sequences.</p>
<p>Until the public databases are purged of spurious Bacterial and Archaeal proteins within rRNA operons, metatranscriptomic researchers need to be cognizant of the strong potential for false positives stemming from a failure to completely remove all rRNA from their analysis pipelines prior to translating the putative rRNA and querying a ‘trusted’ protein database. The ‘trusted’ protein database can be queried with pseudoreads of rRNA in order to reveal the thousands of misannotations it will undoubtedly have until rRNA annotation and curation procedures are improved.</p>
</sec>
<sec>
<title>SUPPLEMENTARY DATA</title>
<p>
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/full/gkr576/DC1">Supplementary Data</ext-link>
are available at NAR Online.</p>
</sec>
<sec>
<title>FUNDING</title>
<p>
<funding-source>The National Science Foundation</funding-source>
(grants
<award-id>EF0424599</award-id>
,
<award-id>OCE0425363</award-id>
);
<funding-source>Gordon and Betty Moore Foundation</funding-source>
(MEGAMER facility grant to
<funding-source>University of California at Santa Cruz, Investigator</funding-source>
grant to J.Z.). Funding for open access charge:
<funding-source>Gordon and Betty Moore Foundation</funding-source>
.</p>
<p>
<italic>Conflict of interest statement</italic>
. None declared.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>Supplementary Data</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_39_20_8792__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_gkr576_StandardOperatingProcedures_Combined.pdf"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="vnd.ms-excel" xlink:href="supp_gkr576_SuppDocCombined.xls"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>This article is dedicated to the memory of Dr. Benjamin R. Munson. We thank Torsten Wendav for programming, and Irina Ilikchyan for useful discussions; Stephan Schuster, Lynn Tomsho and Ji Qi for pyrosequencing.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="B1">
<label>1</label>
<element-citation publication-type="book">
<person-group person-group-type="editor">
<name>
<surname>Roberts</surname>
<given-names>R</given-names>
</name>
</person-group>
<source>Microsomal Particles and Protein Synthesis</source>
<year>1958</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Pergamon Press</publisher-name>
</element-citation>
</ref>
<ref id="B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Woese</surname>
<given-names>CR</given-names>
</name>
<name>
<surname>Fox</surname>
<given-names>GE</given-names>
</name>
</person-group>
<article-title>Phylogenetic structure of the prokaryotic domain: the primary kingdoms</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1977</year>
<volume>74</volume>
<fpage>5088</fpage>
<lpage>5090</lpage>
<pub-id pub-id-type="pmid">270744</pub-id>
</element-citation>
</ref>
<ref id="B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dunn</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Studier</surname>
<given-names>FW</given-names>
</name>
</person-group>
<article-title>T7 early RNAs and
<italic>Escherichia coli</italic>
ribosomal RNAs are cut from large precursor RNAs
<italic>in vivo</italic>
by ribonuclease 3</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1973</year>
<volume>70</volume>
<fpage>3296</fpage>
<lpage>3300</lpage>
<pub-id pub-id-type="pmid">4587248</pub-id>
</element-citation>
</ref>
<ref id="B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ginsburg</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Steitz</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>The 30S ribosomal precursor RNA from
<italic>Escherichia coli</italic>
. A primary transcript containing 23 S, 16 S, and 5S sequences</article-title>
<source>J. Biol. Chem.</source>
<year>1975</year>
<volume>250</volume>
<fpage>5647</fpage>
<lpage>5654</lpage>
<pub-id pub-id-type="pmid">1095585</pub-id>
</element-citation>
</ref>
<ref id="B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Smitt</surname>
<given-names>WW</given-names>
</name>
<name>
<surname>Vlak</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Schiphof</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rozijn</surname>
<given-names>TH</given-names>
</name>
</person-group>
<article-title>Precursors of ribosomal RNA in yeast nucleus. Biosynthesis and relation to cytoplasmic ribosomal RNA</article-title>
<source>Exp. Cell Res.</source>
<year>1972</year>
<volume>71</volume>
<fpage>33</fpage>
<lpage>40</lpage>
<pub-id pub-id-type="pmid">5025942</pub-id>
</element-citation>
</ref>
<ref id="B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Udem</surname>
<given-names>SA</given-names>
</name>
<name>
<surname>Warner</surname>
<given-names>JR</given-names>
</name>
</person-group>
<article-title>The cytoplasmic maturation of a ribosomal precursor ribonucleic acid in yeast</article-title>
<source>J. Biol. Chem.</source>
<year>1973</year>
<volume>248</volume>
<fpage>1412</fpage>
<lpage>1416</lpage>
<pub-id pub-id-type="pmid">4568815</pub-id>
</element-citation>
</ref>
<ref id="B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brosius</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Dull</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Noller</surname>
<given-names>HF</given-names>
</name>
</person-group>
<article-title>Complete nucleotide sequence of a 23S ribosomal RNA gene from
<italic>Escherichia coli</italic>
</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1980</year>
<volume>77</volume>
<fpage>201</fpage>
<lpage>204</lpage>
<pub-id pub-id-type="pmid">6153795</pub-id>
</element-citation>
</ref>
<ref id="B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brosius</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Palmer</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Kennedy</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Noller</surname>
<given-names>HF</given-names>
</name>
</person-group>
<article-title>Complete nucleotide sequence of a 16S ribosomal RNA gene from
<italic>Escherichia coli</italic>
</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1978</year>
<volume>75</volume>
<fpage>4801</fpage>
<lpage>4805</lpage>
<pub-id pub-id-type="pmid">368799</pub-id>
</element-citation>
</ref>
<ref id="B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brownlee</surname>
<given-names>GG</given-names>
</name>
<name>
<surname>Sanger</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Barrell</surname>
<given-names>BG</given-names>
</name>
</person-group>
<article-title>Nucleotide sequence of 5S-ribosomal RNA from
<italic>Escherichia coli</italic>
</article-title>
<source>Nature</source>
<year>1967</year>
<volume>215</volume>
<fpage>735</fpage>
<lpage>736</lpage>
<pub-id pub-id-type="pmid">4862513</pub-id>
</element-citation>
</ref>
<ref id="B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carbon</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ehresmann</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Ehresmann</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ebel</surname>
<given-names>JP</given-names>
</name>
</person-group>
<article-title>The sequence of
<italic>Escherichia coli</italic>
ribosomal 16 S RNA determined by new rapid gel methods</article-title>
<source>FEBS Lett.</source>
<year>1978</year>
<volume>94</volume>
<fpage>152</fpage>
<lpage>156</lpage>
<pub-id pub-id-type="pmid">359355</pub-id>
</element-citation>
</ref>
<ref id="B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Georgiev</surname>
<given-names>OI</given-names>
</name>
<name>
<surname>Nikolaev</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Hadjiolov</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Skryabin</surname>
<given-names>KG</given-names>
</name>
<name>
<surname>Zakharyev</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Bayev</surname>
<given-names>AA</given-names>
</name>
</person-group>
<article-title>The structure of the yeast ribosomal RNA genes. 4. Complete sequence of the 25 S rRNA gene from
<italic>Saccharomyces cerevisae</italic>
</article-title>
<source>Nucleic Acids Res.</source>
<year>1981</year>
<volume>9</volume>
<fpage>6953</fpage>
<lpage>6958</lpage>
<pub-id pub-id-type="pmid">6460984</pub-id>
</element-citation>
</ref>
<ref id="B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hindley</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Page</surname>
<given-names>SM</given-names>
</name>
</person-group>
<article-title>Nucleotide sequence of yeast 5S ribosomal RNA</article-title>
<source>FEBS Lett.</source>
<year>1972</year>
<volume>26</volume>
<fpage>157</fpage>
<lpage>160</lpage>
<pub-id pub-id-type="pmid">4636724</pub-id>
</element-citation>
</ref>
<ref id="B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rubin</surname>
<given-names>GM</given-names>
</name>
</person-group>
<article-title>The nucleotide sequence of
<italic>Saccharomyces cerevisae</italic>
5.8 S ribosomal ribonucleic acid</article-title>
<source>J. Biol. Chem.</source>
<year>1973</year>
<volume>248</volume>
<fpage>3860</fpage>
<lpage>3875</lpage>
<pub-id pub-id-type="pmid">4575197</pub-id>
</element-citation>
</ref>
<ref id="B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rubtsov</surname>
<given-names>PM</given-names>
</name>
<name>
<surname>Musakhanov</surname>
<given-names>MM</given-names>
</name>
<name>
<surname>Zakharyev</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Krayev</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Skryabin</surname>
<given-names>KG</given-names>
</name>
<name>
<surname>Bayev</surname>
<given-names>AA</given-names>
</name>
</person-group>
<article-title>The structure of the yeast ribosomal RNA genes. I. The complete nucleotide sequence of the 18S ribosomal RNA gene from
<italic>Saccharomyces cerevisiae</italic>
</article-title>
<source>Nucleic Acids Res.</source>
<year>1980</year>
<volume>8</volume>
<fpage>5779</fpage>
<lpage>5794</lpage>
<pub-id pub-id-type="pmid">7008030</pub-id>
</element-citation>
</ref>
<ref id="B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tenson</surname>
<given-names>T</given-names>
</name>
<name>
<surname>DeBlasio</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mankin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>A functional peptide encoded in the
<italic>Escherichia coli</italic>
23S rRNA</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1996</year>
<volume>93</volume>
<fpage>5641</fpage>
<lpage>5646</lpage>
<pub-id pub-id-type="pmid">8643630</pub-id>
</element-citation>
</ref>
<ref id="B16">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mitschke</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Georg</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Scholz</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Dienst</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bantscheff</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Voss</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Steglich</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Wilde</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Vogel</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>An experimentally anchored map of transcriptional start sites in the model cyanobacterium
<italic>Synechocystis</italic>
sp. PCC6803</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<volume>108</volume>
<fpage>2124</fpage>
<lpage>2129</lpage>
</element-citation>
</ref>
<ref id="B17">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aziz</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Bartels</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Best</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>DeJongh</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Disz</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Formsma</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Gerdes</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Glass</surname>
<given-names>EM</given-names>
</name>
<name>
<surname>Kubal</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The RAST Server: rapid annotations using subsystems technology</article-title>
<source>BMC Genomics</source>
<year>2008</year>
<volume>9</volume>
<fpage>75</fpage>
<pub-id pub-id-type="pmid">18261238</pub-id>
</element-citation>
</ref>
<ref id="B18">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Coelho</surname>
<given-names>PS</given-names>
</name>
<name>
<surname>Bryan</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Shadel</surname>
<given-names>GS</given-names>
</name>
<name>
<surname>Snyder</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>A novel mitochondrial protein, Tar1p, is encoded on the antisense strand of the nuclear 25S rDNA</article-title>
<source>Genes Dev.</source>
<year>2002</year>
<volume>16</volume>
<fpage>2755</fpage>
<lpage>2760</lpage>
<pub-id pub-id-type="pmid">12414727</pub-id>
</element-citation>
</ref>
<ref id="B19">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mauro</surname>
<given-names>VP</given-names>
</name>
<name>
<surname>Edelman</surname>
<given-names>GM</given-names>
</name>
</person-group>
<article-title>rRNA-like sequences occur in diverse primary transcripts: implications for the control of gene expression</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>1997</year>
<volume>94</volume>
<fpage>422</fpage>
<lpage>427</lpage>
<pub-id pub-id-type="pmid">9012798</pub-id>
</element-citation>
</ref>
<ref id="B20">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chooi</surname>
<given-names>WY</given-names>
</name>
<name>
<surname>Leiby</surname>
<given-names>KR</given-names>
</name>
</person-group>
<article-title>The
<italic>in vivo</italic>
expression of pseudo ribosomal RNA genes in
<italic>Drosophila melanogaster</italic>
</article-title>
<source>Mol. Gen. Genet.</source>
<year>1981</year>
<volume>182</volume>
<fpage>245</fpage>
<lpage>251</lpage>
<pub-id pub-id-type="pmid">6793808</pub-id>
</element-citation>
</ref>
<ref id="B21">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kermekchiev</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Ribin, a protein encoded by a message complementary to rRNA, modulates ribosomal transcription and cell proliferation</article-title>
<source>Mol. Cell Biol.</source>
<year>2001</year>
<volume>21</volume>
<fpage>8255</fpage>
<lpage>8263</lpage>
<pub-id pub-id-type="pmid">11713263</pub-id>
</element-citation>
</ref>
<ref id="B22">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scharf</surname>
<given-names>ME</given-names>
</name>
<name>
<surname>Wu-Scharf</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Pittendrigh</surname>
<given-names>BR</given-names>
</name>
<name>
<surname>Bennett</surname>
<given-names>GW</given-names>
</name>
</person-group>
<article-title>Gene expression profiles among immature and adult reproductive castes of the termite
<italic>Reticulitermes flavipes</italic>
</article-title>
<source>Insect Mol. Biol.</source>
<year>2005</year>
<volume>14</volume>
<fpage>31</fpage>
<lpage>44</lpage>
<pub-id pub-id-type="pmid">15663773</pub-id>
</element-citation>
</ref>
<ref id="B23">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Finn</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Mistry</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tate</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Coggill</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Heger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Pollington</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Gavin</surname>
<given-names>OL</given-names>
</name>
<name>
<surname>Gunasekaran</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ceric</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Forslund</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Pfam protein families database</article-title>
<source>Nucleic Acids Res.</source>
<volume>38</volume>
<fpage>D211</fpage>
<lpage>222</lpage>
<pub-id pub-id-type="pmid">19920124</pub-id>
</element-citation>
</ref>
<ref id="B24">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shi</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tyson</surname>
<given-names>GW</given-names>
</name>
<name>
<surname>DeLong</surname>
<given-names>EF</given-names>
</name>
</person-group>
<article-title>Metatranscriptomics reveals unique microbial small RNAs in the ocean's water column</article-title>
<source>Nature</source>
<year>2009</year>
<volume>459</volume>
<fpage>266</fpage>
<lpage>269</lpage>
<pub-id pub-id-type="pmid">19444216</pub-id>
</element-citation>
</ref>
<ref id="B25">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Altintas</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Peltier</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Stocks</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Allen</surname>
<given-names>EE</given-names>
</name>
<name>
<surname>Ellisman</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Grethe</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource</article-title>
<source>Nucleic Acids Res.</source>
<volume>39</volume>
<fpage>D546</fpage>
<lpage>D551</lpage>
<pub-id pub-id-type="pmid">21045053</pub-id>
</element-citation>
</ref>
<ref id="B26">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rutherford</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Parkhill</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Crook</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Horsnell</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Rice</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rajandream</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Barrell</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Artemis: sequence visualization and annotation</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<fpage>944</fpage>
<lpage>945</lpage>
<pub-id pub-id-type="pmid">11120685</pub-id>
</element-citation>
</ref>
<ref id="B27">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Markowitz</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>IM</given-names>
</name>
<name>
<surname>Palaniappan</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Szeto</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Grechkin</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Ratner</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lykidis</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The integrated microbial genomes system: an expanding comparative analysis resource</article-title>
<source>Nucleic Acids Res.</source>
<volume>38</volume>
<fpage>D382</fpage>
<lpage>D390</lpage>
<pub-id pub-id-type="pmid">19864254</pub-id>
</element-citation>
</ref>
<ref id="B28">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pruesse</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Quast</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Knittel</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Fuchs</surname>
<given-names>BM</given-names>
</name>
<name>
<surname>Ludwig</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Peplies</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Glockner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB</article-title>
<source>Nucleic Acids Res.</source>
<year>2007</year>
<volume>35</volume>
<fpage>7188</fpage>
<lpage>7196</lpage>
<pub-id pub-id-type="pmid">17947321</pub-id>
</element-citation>
</ref>
<ref id="B29">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huson</surname>
<given-names>DH</given-names>
</name>
<name>
<surname>Auch</surname>
<given-names>AF</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Schuster</surname>
<given-names>SC</given-names>
</name>
</person-group>
<article-title>MEGAN analysis of metagenomic data</article-title>
<source>Genome Res</source>
<year>2007</year>
<volume>17</volume>
<fpage>377</fpage>
<lpage>386</lpage>
<pub-id pub-id-type="pmid">17255551</pub-id>
</element-citation>
</ref>
<ref id="B30">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lagesen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Hallin</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rodland</surname>
<given-names>EA</given-names>
</name>
<name>
<surname>Staerfeldt</surname>
<given-names>HH</given-names>
</name>
<name>
<surname>Rognes</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Ussery</surname>
<given-names>DW</given-names>
</name>
</person-group>
<article-title>RNAmmer: consistent and rapid annotation of ribosomal RNA genes</article-title>
<source>Nucleic Acids Res.</source>
<year>2007</year>
<volume>35</volume>
<fpage>3100</fpage>
<lpage>3108</lpage>
<pub-id pub-id-type="pmid">17452365</pub-id>
</element-citation>
</ref>
<ref id="B31">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liolios</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>IM</given-names>
</name>
<name>
<surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Tavernarakis</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Markowitz</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Kyrpides</surname>
<given-names>NC</given-names>
</name>
</person-group>
<article-title>The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata</article-title>
<source>Nucleic Acids Res.</source>
<year>2009</year>
<volume>38</volume>
<fpage>D346</fpage>
<lpage>D354</lpage>
<pub-id pub-id-type="pmid">19914934</pub-id>
</element-citation>
</ref>
<ref id="B32">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kawarabayasi</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Sawada</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Horikawa</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Haikawa</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hino</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Yamamoto</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sekine</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Baba</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kosugi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Hosoyama</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Complete sequence and gene organization of the genome of a hyper-thermophilic archaebacterium,
<italic>Pyrococcus horikoshii</italic>
OT3</article-title>
<source>DNA Res.</source>
<year>1998</year>
<volume>5</volume>
<fpage>55</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="pmid">9679194</pub-id>
</element-citation>
</ref>
<ref id="B33">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kunst</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Ogasawara</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Moszer</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Albertini</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Alloni</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Azevedo</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Bertero</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Bessieres</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bolotin</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Borchert</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The complete genome sequence of the gram-positive bacterium
<italic>Bacillus subtilis</italic>
</article-title>
<source>Nature</source>
<year>1997</year>
<volume>390</volume>
<fpage>249</fpage>
<lpage>256</lpage>
<pub-id pub-id-type="pmid">9384377</pub-id>
</element-citation>
</ref>
<ref id="B34">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medigue</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Moszer</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Viari</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Danchin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Analysis of a
<italic>Bacillus subtilis</italic>
genome fragment using a co-operative computer system prototype</article-title>
<source>Gene</source>
<year>1995</year>
<volume>165</volume>
<fpage>GC37</fpage>
<lpage>GC51</lpage>
<pub-id pub-id-type="pmid">7489895</pub-id>
</element-citation>
</ref>
<ref id="B35">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Medigue</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Rouxel</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Vigier</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Henaut</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Danchin</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Evidence for horizontal gene transfer in
<italic>Escherichia coli</italic>
speciation</article-title>
<source>J. Mol. Biol.</source>
<year>1991</year>
<volume>222</volume>
<fpage>851</fpage>
<lpage>856</lpage>
<pub-id pub-id-type="pmid">1762151</pub-id>
</element-citation>
</ref>
<ref id="B36">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Staden</surname>
<given-names>R</given-names>
</name>
<name>
<surname>McLachlan</surname>
<given-names>AD</given-names>
</name>
</person-group>
<article-title>Codon preference and its use in identifying protein coding regions in long DNA sequences</article-title>
<source>Nucleic Acids Res.</source>
<year>1982</year>
<volume>10</volume>
<fpage>141</fpage>
<lpage>156</lpage>
<pub-id pub-id-type="pmid">7063399</pub-id>
</element-citation>
</ref>
<ref id="B37">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krogh</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mian</surname>
<given-names>IS</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>A hidden Markov model that finds genes in
<italic>E. coli</italic>
DNA</article-title>
<source>Nucleic Acids Res.</source>
<year>1994</year>
<volume>22</volume>
<fpage>4768</fpage>
<lpage>4778</lpage>
<pub-id pub-id-type="pmid">7984429</pub-id>
</element-citation>
</ref>
<ref id="B38">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Overbeek</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Begley</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Butler</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Choudhuri</surname>
<given-names>JV</given-names>
</name>
<name>
<surname>Chuang</surname>
<given-names>HY</given-names>
</name>
<name>
<surname>Cohoon</surname>
<given-names>M</given-names>
</name>
<name>
<surname>de Crecy-Lagard</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Diaz</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Disz</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>R</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes</article-title>
<source>Nucleic Acids Res.</source>
<year>2005</year>
<volume>33</volume>
<fpage>5691</fpage>
<lpage>5702</lpage>
<pub-id pub-id-type="pmid">16214803</pub-id>
</element-citation>
</ref>
<ref id="B39">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bakke</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Carney</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Deloache</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Gearing</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ingvorsen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Lotz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>McNair</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Penumetcha</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Simpson</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Voss</surname>
<given-names>L</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Evaluation of three automated genome annotations for
<italic>Halorhabdus utahensis</italic>
</article-title>
<source>PLoS ONE</source>
<year>2009</year>
<volume>4</volume>
<fpage>e6291</fpage>
<pub-id pub-id-type="pmid">19617911</pub-id>
</element-citation>
</ref>
<ref id="B40">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Markowitz</surname>
<given-names>VM</given-names>
</name>
<name>
<surname>Mavromatis</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ivanova</surname>
<given-names>NN</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>IM</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Kyrpides</surname>
<given-names>NC</given-names>
</name>
</person-group>
<article-title>IMG ER: a system for microbial genome annotation expert review and curation</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>2271</fpage>
<lpage>2278</lpage>
<pub-id pub-id-type="pmid">19561336</pub-id>
</element-citation>
</ref>
<ref id="B41">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scheunemann</surname>
<given-names>AE</given-names>
</name>
<name>
<surname>Graham</surname>
<given-names>WD</given-names>
</name>
<name>
<surname>Vendeix</surname>
<given-names>FA</given-names>
</name>
<name>
<surname>Agris</surname>
<given-names>PF</given-names>
</name>
</person-group>
<article-title>Binding of aminoglycoside antibiotics to helix 69 of 23S rRNA</article-title>
<source>Nucleic Acids Res.</source>
<volume>38</volume>
<fpage>3094</fpage>
<lpage>3105</lpage>
<pub-id pub-id-type="pmid">20110260</pub-id>
</element-citation>
</ref>
<ref id="B42">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maguire</surname>
<given-names>BA</given-names>
</name>
</person-group>
<article-title>Inhibition of bacterial ribosome assembly: a suitable drug target?</article-title>
<source>Microbiol. Mol. Biol. Rev.</source>
<year>2009</year>
<volume>73</volume>
<fpage>22</fpage>
<lpage>35</lpage>
<pub-id pub-id-type="pmid">19258531</pub-id>
</element-citation>
</ref>
<ref id="B43">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carter</surname>
<given-names>AP</given-names>
</name>
<name>
<surname>Clemons</surname>
<given-names>WM</given-names>
</name>
<name>
<surname>Brodersen</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Morgan-Warren</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Wimberly</surname>
<given-names>BT</given-names>
</name>
<name>
<surname>Ramakrishnan</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Functional insights from the structure of the 30S ribosomal subunit and its interactions with antibiotics</article-title>
<source>Nature</source>
<year>2000</year>
<volume>407</volume>
<fpage>340</fpage>
<lpage>348</lpage>
<pub-id pub-id-type="pmid">11014183</pub-id>
</element-citation>
</ref>
<ref id="B44">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mehta</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Champney</surname>
<given-names>WS</given-names>
</name>
</person-group>
<article-title>30S ribosomal subunit assembly is a target for inhibition by aminoglycosides in
<italic>Escherichia coli</italic>
</article-title>
<source>Antimicrob. Agents Chemother.</source>
<year>2002</year>
<volume>46</volume>
<fpage> 1546</fpage>
<lpage>1549</lpage>
<pub-id pub-id-type="pmid">11959595</pub-id>
</element-citation>
</ref>
<ref id="B45">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>David-Eden</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Mankin</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Mandel-Gutfreund</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Structural signatures of antibiotic binding sites on the ribosome</article-title>
<source>Nucleic Acids Res.</source>
<volume>38</volume>
<fpage>5982</fpage>
<lpage>5994</lpage>
<pub-id pub-id-type="pmid">20494981</pub-id>
</element-citation>
</ref>
<ref id="B46">
<label>46</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Duc</surname>
<given-names>AC</given-names>
</name>
<name>
<surname>Klosi</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Pattabiraman</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Spaller</surname>
<given-names>MR</given-names>
</name>
<name>
<surname>Chow</surname>
<given-names>CS</given-names>
</name>
</person-group>
<article-title>Selection of peptides that target the aminoacyl-tRNA site of bacterial 16S ribosomal RNA</article-title>
<source>Biochemistry</source>
<year>2009</year>
<volume>48</volume>
<fpage>8299</fpage>
<lpage>8311</lpage>
<pub-id pub-id-type="pmid">19645415</pub-id>
</element-citation>
</ref>
<ref id="B47">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Warner</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Vilardell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sohn</surname>
<given-names>JH</given-names>
</name>
</person-group>
<article-title>Economics of ribosome biosynthesis</article-title>
<source>Cold Spring Harb. Symp. Quant. Biol.</source>
<year>2001</year>
<volume>66</volume>
<fpage>567</fpage>
<lpage>574</lpage>
<pub-id pub-id-type="pmid">12762058</pub-id>
</element-citation>
</ref>
<ref id="B48">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kjeldgaard</surname>
<given-names>NO</given-names>
</name>
<name>
<surname>Gausing</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Regulation of biosynthesis of ribosomes</article-title>
<source>Cold Spring Harb. Monogr. Arch.</source>
<year>1974</year>
<volume>4</volume>
<fpage>369</fpage>
<lpage>392</lpage>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000587 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000587 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3203614
   |texte=   Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21771858" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CyberinfraV1 

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024