Serveur d'exploration sur la télématique

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences

Identifieur interne : 000505 ( Pmc/Corpus ); précédent : 000504; suivant : 000506

BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences

Auteurs : Damiano Piovesan ; Pier Luigi Martelli ; Piero Fariselli ; Andrea Zauli ; Ivan Rossi ; Rita Casadio

Source :

RBID : PMC:3125743

Abstract

We introduce BAR-PLUS (BAR+), a web server for functional and structural annotation of protein sequences. BAR+ is based on a large-scale genome cross comparison and a non-hierarchical clustering procedure characterized by a metric that ensures a reliable transfer of features within clusters. In this version, the method takes advantage of a large-scale pairwise sequence comparison of 13 495 736 protein chains also including 988 complete proteomes. Available sequence annotation is derived from UniProtKB, GO, Pfam and PDB. When PDB templates are present within a cluster (with or without their SCOP classification), profile Hidden Markov Models (HMMs) are computed on the basis of sequence to structure alignment and are cluster-associated (Cluster-HMM). Therefrom, a library of 10 858 HMMs is made available for aligning even distantly related sequences for structural modelling. The server also provides pairwise query sequence–structural target alignments computed from the correspondent Cluster-HMM. BAR+ in its present version allows three main categories of annotation: PDB [with or without SCOP (*)] and GO and/or Pfam; PDB (*) without GO and/or Pfam; GO and/or Pfam without PDB (*) and no annotation. Each category can further comprise clusters where GO and Pfam functional annotations are or are not statistically significant. BAR+ is available at http://bar.biocomp.unibo.it/bar2.0.


Url:
DOI: 10.1093/nar/gkr292
PubMed: 21622657
PubMed Central: 3125743

Links to Exploration step

PMC:3125743

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences</title>
<author>
<name sortKey="Piovesan, Damiano" sort="Piovesan, Damiano" uniqKey="Piovesan D" first="Damiano" last="Piovesan">Damiano Piovesan</name>
<affiliation>
<nlm:aff id="AFF1">Department of Biology, Bologna Biocomputing Group, Bologna Computational Biology Network,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Luigi Martelli, Pier" sort="Luigi Martelli, Pier" uniqKey="Luigi Martelli P" first="Pier" last="Luigi Martelli">Pier Luigi Martelli</name>
<affiliation>
<nlm:aff id="AFF1">Department of Biology, Bologna Biocomputing Group, Bologna Computational Biology Network,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fariselli, Piero" sort="Fariselli, Piero" uniqKey="Fariselli P" first="Piero" last="Fariselli">Piero Fariselli</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Computer Science, University of Bologna, Bologna</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zauli, Andrea" sort="Zauli, Andrea" uniqKey="Zauli A" first="Andrea" last="Zauli">Andrea Zauli</name>
<affiliation>
<nlm:aff id="AFF1">BioDec srl, Bologna, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rossi, Ivan" sort="Rossi, Ivan" uniqKey="Rossi I" first="Ivan" last="Rossi">Ivan Rossi</name>
<affiliation>
<nlm:aff id="AFF1">BioDec srl, Bologna, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Casadio, Rita" sort="Casadio, Rita" uniqKey="Casadio R" first="Rita" last="Casadio">Rita Casadio</name>
<affiliation>
<nlm:aff id="AFF1">Department of Biology, Bologna Biocomputing Group, Bologna Computational Biology Network,</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21622657</idno>
<idno type="pmc">3125743</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3125743</idno>
<idno type="RBID">PMC:3125743</idno>
<idno type="doi">10.1093/nar/gkr292</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000505</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000505</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences</title>
<author>
<name sortKey="Piovesan, Damiano" sort="Piovesan, Damiano" uniqKey="Piovesan D" first="Damiano" last="Piovesan">Damiano Piovesan</name>
<affiliation>
<nlm:aff id="AFF1">Department of Biology, Bologna Biocomputing Group, Bologna Computational Biology Network,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Luigi Martelli, Pier" sort="Luigi Martelli, Pier" uniqKey="Luigi Martelli P" first="Pier" last="Luigi Martelli">Pier Luigi Martelli</name>
<affiliation>
<nlm:aff id="AFF1">Department of Biology, Bologna Biocomputing Group, Bologna Computational Biology Network,</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fariselli, Piero" sort="Fariselli, Piero" uniqKey="Fariselli P" first="Piero" last="Fariselli">Piero Fariselli</name>
<affiliation>
<nlm:aff wicri:cut=" and" id="AFF1">Department of Computer Science, University of Bologna, Bologna</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zauli, Andrea" sort="Zauli, Andrea" uniqKey="Zauli A" first="Andrea" last="Zauli">Andrea Zauli</name>
<affiliation>
<nlm:aff id="AFF1">BioDec srl, Bologna, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rossi, Ivan" sort="Rossi, Ivan" uniqKey="Rossi I" first="Ivan" last="Rossi">Ivan Rossi</name>
<affiliation>
<nlm:aff id="AFF1">BioDec srl, Bologna, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Casadio, Rita" sort="Casadio, Rita" uniqKey="Casadio R" first="Rita" last="Casadio">Rita Casadio</name>
<affiliation>
<nlm:aff id="AFF1">Department of Biology, Bologna Biocomputing Group, Bologna Computational Biology Network,</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Nucleic Acids Research</title>
<idno type="ISSN">0305-1048</idno>
<idno type="eISSN">1362-4962</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>We introduce BAR-PLUS (BAR
<sup>+</sup>
), a web server for functional and structural annotation of protein sequences. BAR
<sup>+</sup>
is based on a large-scale genome cross comparison and a non-hierarchical clustering procedure characterized by a metric that ensures a reliable transfer of features within clusters. In this version, the method takes advantage of a large-scale pairwise sequence comparison of 13 495 736 protein chains also including 988 complete proteomes. Available sequence annotation is derived from UniProtKB, GO, Pfam and PDB. When PDB templates are present within a cluster (with or without their SCOP classification), profile Hidden Markov Models (HMMs) are computed on the basis of sequence to structure alignment and are cluster-associated (Cluster-HMM). Therefrom, a library of 10 858 HMMs is made available for aligning even distantly related sequences for structural modelling. The server also provides pairwise query sequence–structural target alignments computed from the correspondent Cluster-HMM. BAR
<sup>+</sup>
in its present version allows three main categories of annotation: PDB [with or without SCOP (*)] and GO and/or Pfam; PDB (*) without GO and/or Pfam; GO and/or Pfam without PDB (*) and no annotation. Each category can further comprise clusters where GO and Pfam functional annotations are or are not statistically significant. BAR
<sup>+</sup>
is available at
<ext-link ext-link-type="uri" xlink:href="http://bar.biocomp.unibo.it/bar2.0">http://bar.biocomp.unibo.it/bar2.0</ext-link>
.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Krause, A" uniqKey="Krause A">A Krause</name>
</author>
<author>
<name sortKey="Stoye, J" uniqKey="Stoye J">J Stoye</name>
</author>
<author>
<name sortKey="Vingron, M" uniqKey="Vingron M">M Vingron</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heger, A" uniqKey="Heger A">A Heger</name>
</author>
<author>
<name sortKey="Holm, L" uniqKey="Holm L">L Holm</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Ch" uniqKey="Wu C">CH Wu</name>
</author>
<author>
<name sortKey="Huang, H" uniqKey="Huang H">H Huang</name>
</author>
<author>
<name sortKey="Nikolskaya, A" uniqKey="Nikolskaya A">A Nikolskaya</name>
</author>
<author>
<name sortKey="Hu, Z" uniqKey="Hu Z">Z Hu</name>
</author>
<author>
<name sortKey="Barker, Wc" uniqKey="Barker W">WC Barker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kriventseva, Ev" uniqKey="Kriventseva E">EV Kriventseva</name>
</author>
<author>
<name sortKey="Fleischmann, W" uniqKey="Fleischmann W">W Fleischmann</name>
</author>
<author>
<name sortKey="Zdobnov, Em" uniqKey="Zdobnov E">EM Zdobnov</name>
</author>
<author>
<name sortKey="Apweiler, R" uniqKey="Apweiler R">R Apweiler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Petryszak, R" uniqKey="Petryszak R">R Petryszak</name>
</author>
<author>
<name sortKey="Kretschmann, E" uniqKey="Kretschmann E">E Kretschmann</name>
</author>
<author>
<name sortKey="Wieser, D" uniqKey="Wieser D">D Wieser</name>
</author>
<author>
<name sortKey="Apweiler, R" uniqKey="Apweiler R">R Apweiler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaplan, N" uniqKey="Kaplan N">N Kaplan</name>
</author>
<author>
<name sortKey="Sasson, O" uniqKey="Sasson O">O Sasson</name>
</author>
<author>
<name sortKey="Inbar, U" uniqKey="Inbar U">U Inbar</name>
</author>
<author>
<name sortKey="Friedlich, M" uniqKey="Friedlich M">M Friedlich</name>
</author>
<author>
<name sortKey="Fromer, M" uniqKey="Fromer M">M Fromer</name>
</author>
<author>
<name sortKey="Fleischer, H" uniqKey="Fleischer H">H Fleischer</name>
</author>
<author>
<name sortKey="Portugaly, E" uniqKey="Portugaly E">E Portugaly</name>
</author>
<author>
<name sortKey="Linial, N" uniqKey="Linial N">N Linial</name>
</author>
<author>
<name sortKey="Linial, M" uniqKey="Linial M">M Linial</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Loewenstein, Y" uniqKey="Loewenstein Y">Y Loewenstein</name>
</author>
<author>
<name sortKey="Portugaly, E" uniqKey="Portugaly E">E Portugaly</name>
</author>
<author>
<name sortKey="Fromer, M" uniqKey="Fromer M">M Fromer</name>
</author>
<author>
<name sortKey="Linial, M" uniqKey="Linial M">M Linial</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sperisen, P" uniqKey="Sperisen P">P Sperisen</name>
</author>
<author>
<name sortKey="Pagni, M" uniqKey="Pagni M">M Pagni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Enright, Aj" uniqKey="Enright A">AJ Enright</name>
</author>
<author>
<name sortKey="Van Dongen, S" uniqKey="Van Dongen S">S Van Dongen</name>
</author>
<author>
<name sortKey="Ouzounis, Ca" uniqKey="Ouzounis C">CA Ouzounis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cuff, Al" uniqKey="Cuff A">AL Cuff</name>
</author>
<author>
<name sortKey="Sillitoe, I" uniqKey="Sillitoe I">I Sillitoe</name>
</author>
<author>
<name sortKey="Lewis, T" uniqKey="Lewis T">T Lewis</name>
</author>
<author>
<name sortKey="Clegg, Ab" uniqKey="Clegg A">AB Clegg</name>
</author>
<author>
<name sortKey="Rentzsch, R" uniqKey="Rentzsch R">R Rentzsch</name>
</author>
<author>
<name sortKey="Furnham, N" uniqKey="Furnham N">N Furnham</name>
</author>
<author>
<name sortKey="Pellegrini Calace, M" uniqKey="Pellegrini Calace M">M Pellegrini-Calace</name>
</author>
<author>
<name sortKey="Jones, D" uniqKey="Jones D">D Jones</name>
</author>
<author>
<name sortKey="Thornton, J" uniqKey="Thornton J">J Thornton</name>
</author>
<author>
<name sortKey="Orengo, Ca" uniqKey="Orengo C">CA Orengo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bartoli, L" uniqKey="Bartoli L">L Bartoli</name>
</author>
<author>
<name sortKey="Montanucci, L" uniqKey="Montanucci L">L Montanucci</name>
</author>
<author>
<name sortKey="Fronza, R" uniqKey="Fronza R">R Fronza</name>
</author>
<author>
<name sortKey="Martelli, Pl" uniqKey="Martelli P">PL Martelli</name>
</author>
<author>
<name sortKey="Fariselli, P" uniqKey="Fariselli P">P Fariselli</name>
</author>
<author>
<name sortKey="Carota, L" uniqKey="Carota L">L Carota</name>
</author>
<author>
<name sortKey="Donvito, G" uniqKey="Donvito G">G Donvito</name>
</author>
<author>
<name sortKey="Maggi, G" uniqKey="Maggi G">G Maggi</name>
</author>
<author>
<name sortKey="Casadio, R" uniqKey="Casadio R">R Casadio</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mcginnis, S" uniqKey="Mcginnis S">S McGinnis</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Konagurthu, As" uniqKey="Konagurthu A">AS Konagurthu</name>
</author>
<author>
<name sortKey="Whisstock, Jc" uniqKey="Whisstock J">JC Whisstock</name>
</author>
<author>
<name sortKey="Stuckey, Pj" uniqKey="Stuckey P">PJ Stuckey</name>
</author>
<author>
<name sortKey="Lesk, Al" uniqKey="Lesk A">AL Lesk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Eddy, Sr" uniqKey="Eddy S">SR Eddy</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="publisher-id">nar</journal-id>
<journal-id journal-id-type="hwp">nar</journal-id>
<journal-title-group>
<journal-title>Nucleic Acids Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0305-1048</issn>
<issn pub-type="epub">1362-4962</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21622657</article-id>
<article-id pub-id-type="pmc">3125743</article-id>
<article-id pub-id-type="doi">10.1093/nar/gkr292</article-id>
<article-id pub-id-type="publisher-id">gkr292</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Articles</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Piovesan</surname>
<given-names>Damiano</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Luigi Martelli</surname>
<given-names>Pier</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Fariselli</surname>
<given-names>Piero</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zauli</surname>
<given-names>Andrea</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Rossi</surname>
<given-names>Ivan</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Casadio</surname>
<given-names>Rita</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="COR1">*</xref>
</contrib>
</contrib-group>
<aff id="AFF1">
<sup>1</sup>
Department of Biology, Bologna Biocomputing Group, Bologna Computational Biology Network,
<sup>2</sup>
Department of Computer Science, University of Bologna, Bologna and
<sup>3</sup>
BioDec srl, Bologna, Italy</aff>
<author-notes>
<corresp id="COR1">*To whom correspondence should be addressed. Tel:
<phone>+39 0512094005</phone>
; Fax:
<fax>+39 0512094005</fax>
; Email:
<email>casadio@biocomp.unibo.it</email>
</corresp>
</author-notes>
<pmc-comment>For NAR both ppub and collection dates generated for PMC processing 1/27/05 beck</pmc-comment>
<pub-date pub-type="collection">
<day>1</day>
<month>7</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="ppub">
<day>1</day>
<month>7</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>26</day>
<month>5</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>26</day>
<month>5</month>
<year>2011</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>39</volume>
<issue>Web Server issue</issue>
<issue-title>Web Server issue</issue-title>
<fpage>W197</fpage>
<lpage>W202</lpage>
<history>
<date date-type="received">
<day>4</day>
<month>2</month>
<year>2011</year>
</date>
<date date-type="rev-recd">
<day>4</day>
<month>4</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>13</day>
<month>4</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2011. Published by Oxford University Press.</copyright-statement>
<copyright-year>2011</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">http://creativecommons.org/licenses/by-nc/3.0</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>We introduce BAR-PLUS (BAR
<sup>+</sup>
), a web server for functional and structural annotation of protein sequences. BAR
<sup>+</sup>
is based on a large-scale genome cross comparison and a non-hierarchical clustering procedure characterized by a metric that ensures a reliable transfer of features within clusters. In this version, the method takes advantage of a large-scale pairwise sequence comparison of 13 495 736 protein chains also including 988 complete proteomes. Available sequence annotation is derived from UniProtKB, GO, Pfam and PDB. When PDB templates are present within a cluster (with or without their SCOP classification), profile Hidden Markov Models (HMMs) are computed on the basis of sequence to structure alignment and are cluster-associated (Cluster-HMM). Therefrom, a library of 10 858 HMMs is made available for aligning even distantly related sequences for structural modelling. The server also provides pairwise query sequence–structural target alignments computed from the correspondent Cluster-HMM. BAR
<sup>+</sup>
in its present version allows three main categories of annotation: PDB [with or without SCOP (*)] and GO and/or Pfam; PDB (*) without GO and/or Pfam; GO and/or Pfam without PDB (*) and no annotation. Each category can further comprise clusters where GO and Pfam functional annotations are or are not statistically significant. BAR
<sup>+</sup>
is available at
<ext-link ext-link-type="uri" xlink:href="http://bar.biocomp.unibo.it/bar2.0">http://bar.biocomp.unibo.it/bar2.0</ext-link>
.</p>
</abstract>
<counts>
<page-count count="6"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="">
<title>INTRODUCTION</title>
<p>In the post-genomic era, with the advent of rapid sequencing techniques, reliable and efficient functional annotation methods are needed. Routinely, a translated protein sequence is aligned towards a data base of already annotated sequences and by this it is endowed with different features depending on the level of sequence identity (SI). This similarity search is the basis for transfer of annotation by homology. The UniProt Knowledgebase (UniProtKB;
<ext-link ext-link-type="uri" xlink:href="http://www.UniProtKB.org/">http://www.UniProtKB.org/</ext-link>
) is presently our major resource of information of protein sequences and of corresponding functions and structures, when available. It provides links also to other resources/data bases, allowing a comprehensive knowledge of experimental and computational characteristics of known/putative proteins and genes. However, only 4.4% of the all protein universe that presently (UniProtKB release 2011_03; 8 March 2011) includes some 14 million of sequences has evidence at the protein and at the transcript level. With this scenario, inference of function and structure among related sequences requires the definition of rules to increase the reliability of annotation. This is routinely obtained with clustering methods by which sequences are included into sets of similarity. Clustering can be hierarchical and non-hierarchical. Hierarchical clustering categorizes sequences into a tree-structure. Examples of hierarchical clustering include SYSTERS (
<xref ref-type="bibr" rid="B1">1</xref>
), Picasso (
<xref ref-type="bibr" rid="B2">2</xref>
) and iProClass (
<xref ref-type="bibr" rid="B3">3</xref>
). CluSTr (
<xref ref-type="bibr" rid="B4">4</xref>
,
<xref ref-type="bibr" rid="B5">5</xref>
) and ProtoNet (
<xref ref-type="bibr" rid="B6">6</xref>
,
<xref ref-type="bibr" rid="B7">7</xref>
) are the only web servers that comprise the large number of sequences made available by fully sequenced genomes and the entire UniProtKB. Both CluSTr and ProtoNet cluster sequences according to different levels of SI, as set by different
<italic>E</italic>
-value thresholds, and with different hierarchical algorithms. Alternatively, non-hierarchical clustering partitions a sequence data set into disjoint clusters (
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B9">9</xref>
). However, neither hierarchical nor non-hierarchical methods consider explicitly proteins containing multiple domains or proteins that sharing common domains do not necessarily have the same function. Proteins with different combinations of shared domains can have different molecular and biological functions, as recently re-discussed (
<xref ref-type="bibr" rid="B10">10</xref>
). In order to address these problems, we developed BAR (
<xref ref-type="bibr" rid="B11">11</xref>
), an annotation procedure that relies on a non-hierarchical clustering method and a large-scale genome comparison where pairs of sequences are selected with very strict criteria of similarity and overlapping of the alignment as described in the next section. We provided statistical validation that BAR allows reliable functional and structural annotation in addition to that given by commonly used databases (
<xref ref-type="bibr" rid="B11">11</xref>
). Here, we introduce BAR
<sup>+</sup>
, an updated and extended version of BAR that includes: (i) a 5-fold increase in sequences; (ii) GO terms from the three main roots (molecular function, biological process and cellular localization;
<ext-link ext-link-type="uri" xlink:href="http://www.geneontology.org/">http://www.geneontology.org/</ext-link>
); (iii) Pfam domains (
<ext-link ext-link-type="uri" xlink:href="http://pfam.sanger.ac.uk/">http://pfam.sanger.ac.uk/</ext-link>
); (iv) known ligands and (v) for clusters containing PDB structure/s, a Cluster HMM model and the corresponding alignment of the target sequence to the optimal template in the cluster for computing its 3D structure.</p>
</sec>
<sec>
<title>BAR
<sup>+</sup>
IMPLEMENTATION</title>
<p>BAR
<sup>+</sup>
is constructed by performing an all-against-all pairwise alignment of all protein sequences (collected from the entire UniProtKB 05_2010, with the exclusion of fragments (9 399 063 sequences), and from the proteome of complete sequenced genomes available on the same date at the National Center for Biotechnology Information (NCBI) [
<ext-link ext-link-type="uri" xlink:href="www.ncbi.nlm.nih.gov/genomes/lproks.cgi">www.ncbi.nlm.nih.gov/genomes/lproks.cgi</ext-link>
(Prokaryotes);
<ext-link ext-link-type="uri" xlink:href="www.ncbi.nlm.nih.gov/genomes/leuks.cgi">www.ncbi.nlm.nih.gov/genomes/leuks.cgi</ext-link>
(Eukaryotes)] and at Ensembl (
<ext-link ext-link-type="uri" xlink:href="http://www.ensembl.org/info/data/ftp/index.html">http://www.ensembl.org/info/data/ftp/index.html</ext-link>
) for a total of 988 complete proteomes (the list of the species is available at BAR+ web site). For the sake of comparison, we also used the entire SwissProt 03_2011 (8 March). Similarly to BAR (
<xref ref-type="bibr" rid="B11">11</xref>
), BAR
<sup>+</sup>
is also a non-hierarchical clustering method relying on a comparative large-scale genome analysis. The method relies on a non-hierarchical clustering procedure characterized by a stringent metric that ensures a reliable transfer of features within clusters. In this new version, the method takes advantage of a larger scale pairwise sequence comparison than BAR, including 13 495 736 protein sequences. Alignment is performed with BLAST (
<xref ref-type="bibr" rid="B12">12</xref>
) in a GRID environment (
<xref ref-type="bibr" rid="B11">11</xref>
). From this we compute for each pair both the SI and the Coverage (COV) defined as the ratio of the length of the intersection of the aligned regions on the two sequences and the overall length of the alignment (namely the sum of the lengths of the two sequences minus the intersection length). Each protein is then taken as a node and a graph is built allowing links among nodes only when the following similarity constraints are found among two proteins: their SI is ≥40% and COV is ≥90%. By this, clusters are simply the connected components of the graph (
<xref ref-type="bibr" rid="B11">11</xref>
). A workflow of the method is shown in
<xref ref-type="fig" rid="F1">Figure 1</xref>
. Seventy percent of the whole data set (9 401 223 sequences) falls into 913 962 clusters. Noticeably, 55% of the clusters include 84% of the cluster-included sequences. The number of sequence in the clusters ranges from two up to 87 893 in the most populated (Molecular Function: ABC transporter). Given our stringent criteria, 87% of the clusters contain sequences whose standard deviation (SD) of the protein length is ≤5 residues. The remaining sequences (30% of the total) originate singletons (containing just one sequence). Well annotated sequences are characterized by functional and structural annotations derived from UniProtKB entries (
<xref ref-type="fig" rid="F1">Figure 1</xref>
). These include GO, Pfam, PDB and SCOP (
<ext-link ext-link-type="uri" xlink:href="http://scop.mrc-lmb.cam.ac.uk/scop/">http://scop.mrc-lmb.cam.ac.uk/scop/</ext-link>
) (when available). To assess whether GO and Pfam terms are significant in a cluster, we compute
<italic>P</italic>
-values and given the multiplicity of the terms, we applied the Bonferroni correction (
<xref ref-type="bibr" rid="B11">11</xref>
). We evaluated the cumulative distribution of Bonferroni corrected
<italic>P</italic>
-values by adopting a bootstrapping procedure. From this we set the threshold
<italic>P</italic>
-value at 0.01 in order to discriminate among random and significant (cluster associated) features (
<xref ref-type="bibr" rid="B11">11</xref>
). Validated features (significant for the cluster) are those endowed with
<italic>P</italic>
 ≤ 0.01. According to our procedure when hypothetical and or putative proteins fall into an annotated and validated cluster, they can safely inherit GO terms and Pfam domain/s even in the case of very low SI with the most annotated proteins. These sequences can therefore be labelled as distantly related homologues and inherit function and structure (when available) in a validated manner. We previously discussed that this procedure can increase the level of annotation of UniProtKB (
<xref ref-type="bibr" rid="B11">11</xref>
). Here we increase the level of structural and functional annotations of cluster-included sequences by 54% (
<xref ref-type="fig" rid="F2">Figure 2</xref>
A). When sequences are standing alone (according to our criteria) they are singletons. They can anyway carry along information (
<xref ref-type="fig" rid="F2">Figure 2</xref>
B), provided that each singleton is endowed with PDB and/or Pfam and/or GO annotation.
<fig id="F1" position="float">
<label>Figure 1.</label>
<caption>
<p>BAR
<sup>+</sup>
implementation. Our method collects sequences from the protein universe (UniProtKB) including also some 988 genomes. By this, all the features [PDB (± SCOP classification) (red circles), GO terms (including Molecular Function, Biological Process and Cellular Localization) and Pfam models (blue circles) are also included. An extensive BLAST alignment is performed of all the 13 495 736 sequences in a GRID environment. The sequence similarity network is built by connecting two sequences only if their SI is ≥40% with an overlapping COV ≥ 90%. About 913 762 clusters are obtained by splitting of the connected components. By this, any cluster may contain from 2 up to 87 893 sequences (one cluster containing ABC transporters from Prokaryotes, Eukaryotes and Archaea). Stand alone sequences are called Singletons (30.4% of the total protein universe). Sequences inherit the annotations within a cluster. When clusters are endowed with PDB template/s, a Cluster-HMM is generated by considering all the sequences that have an identity ≥ 40% and a COV ≥ 90% with the structure/s (pink subset). The Cluster-HMM can be used to align all the other sequences in the cluster to template/s.</p>
</caption>
<graphic xlink:href="gkr292f1"></graphic>
</fig>
<fig id="F2" position="float">
<label>Figure 2.</label>
<caption>
<p>Different types of annotations are possible with BAR
<sup>+</sup>
. After clustering and depending on the features (structure, domains and function) annotated in the cluster, sequences within a cluster can inherit different types of annotation. The percentage of sequences endowed with a given annotation type and inheriting validated annotation (
<italic>P</italic>
 < 0.01) is indicated. (
<bold>A</bold>
) Sequences within clusters. Percentage is computed with respect to 9 401 223 comprised in 913 762 clusters. Inherited: sequences that inherit annotations by falling into a cluster. Without validated annotation: the slice comprises sequences with no annotation and not validated annotations. (
<bold>B</bold>
) Singletons (stand alone sequences). Percentage is computed with respect to 4 091 908 singleton sequences.</p>
</caption>
<graphic xlink:href="gkr292f2"></graphic>
</fig>
</p>
</sec>
<sec>
<title>CLUSTER-HMMs</title>
<p>In BAR
<sup>+</sup>
, when PDB templates are present within a cluster (with or without their SCOP classification), profile HMMs are computed on the basis of sequence to structure alignment and are cluster associated (Cluster-HMM) (
<xref ref-type="fig" rid="F1">Figure 1</xref>
). When different templates are present in a cluster the structural alignment among them is computed with MUSTANG (
<xref ref-type="bibr" rid="B13">13</xref>
). Multiple alignments comprising all the overlapping templates and the sequences similar to them (with SI ≥ 40% and COV ≥ 90%) are computed with MUSCLE (
<xref ref-type="bibr" rid="B14">14</xref>
) and fed to HMMER 2.3 (
<xref ref-type="bibr" rid="B15">15</xref>
) in order to train the profile-HMM. By this, a library of 10 858 HMMs is made available for aligning even distantly related sequences to a given PDB template/s. The server also provides the pairwise query sequence–structural target alignment computed with the Viterbi decoding implemented in HMMER from the correspondent Cluster-HMM and useful for further processing and/or computing the corresponding 3D structure
<bold>.</bold>
</p>
</sec>
<sec>
<title>DIFFERENT ANNOTATIONS with BAR
<sup>+</sup>
</title>
<p>BAR
<sup>+</sup>
allows 35 possible fine grain types of annotations (plus no annotation) (
<xref ref-type="table" rid="T1">Table 1</xref>
). The most complete type of annotation is the one with PDB (with and without SCOP annotation) and GO terms and Pfam domains with
<italic>P</italic>
 ≤ 0.01 (validated) (first row in
<xref ref-type="table" rid="T1">Table 1</xref>
). Interestingly, enough 0.11% of the total sequences in our database are sufficient to annotate in a validated manner and with the most complete annotation another 21.99% sharing common clusters (8251; 0.90% of the total), with an annotation gain factor higher than 200. Summing up (along the first row of
<xref ref-type="table" rid="T1">Table 1</xref>
), we can conclude that validated functional annotation is possible within 10% of the clusters. Eleven percent of the sequences remains without annotation and are included in 45% of the clusters. About 57% of singletons (corresponding to 17% of the total set) are annotated with different features (
<xref ref-type="fig" rid="F2">Figure 2</xref>
B and
<xref ref-type="table" rid="T1">Table 1</xref>
).
<table-wrap id="T1" position="float">
<label>Table 1.</label>
<caption>
<p>The fine grain types of annotation with BAR
<sup>+</sup>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">PDB (%)</th>
<th rowspan="1" colspan="1">SCOP Mono</th>
<th rowspan="1" colspan="1">SCOP Multi</th>
<th rowspan="1" colspan="1">Without PDB</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">GO validated</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Pfam validated</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">8251 (0.90)</td>
<td rowspan="1" colspan="1">3613 (0.40)</td>
<td rowspan="1" colspan="1">1461 (0.16)</td>
<td rowspan="1" colspan="1">83 266 (9.11)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">2 982 449 (22.10)</td>
<td rowspan="1" colspan="1">1 408 542 (10.44)</td>
<td rowspan="1" colspan="1">1 028 565 (7.62)</td>
<td rowspan="1" colspan="1">2 903 431 (21.51)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>2 967 743 (21.99)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 404 011 (10.40)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 026 154 (7.60)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 382 310 (10.24)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Pfam</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">8334 (0.91)</td>
<td rowspan="1" colspan="1">3647 (0.40)</td>
<td rowspan="1" colspan="1">1463 (0.16)</td>
<td rowspan="1" colspan="1">85 886 (9.40)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">2 984 057 (22.11)</td>
<td rowspan="1" colspan="1">1 409 647 (10.45)</td>
<td rowspan="1" colspan="1">1 028 569 (7.62)</td>
<td rowspan="1" colspan="1">2 922 876 (21.66)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>2 969 285 (22.00)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 405 095 (10.41)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 026 156 (7.60)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 398 603 (10.36)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Without Pfam</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">320 (0.04)</td>
<td rowspan="1" colspan="1">123 (0.01)</td>
<td rowspan="1" colspan="1">25
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">6251 (0.68)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">42 202 (0.31)</td>
<td rowspan="1" colspan="1">15 415 (0.11)</td>
<td rowspan="1" colspan="1">7363 (0.05)</td>
<td rowspan="1" colspan="1">143 533 (1.06)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>41 825 (0.31)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>15 303 (0.11)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>7331 (0.05)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>93 568 (0.69)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">GO</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Pfam validated</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">8938 (0.98)</td>
<td rowspan="1" colspan="1">3887 (0.43)</td>
<td rowspan="1" colspan="1">1504 (0.16)</td>
<td rowspan="1" colspan="1">133 895 (14.65)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">3 042 649 (22.55)</td>
<td rowspan="1" colspan="1">1 450 437 (10.75)</td>
<td rowspan="1" colspan="1">1 029 707 (7.63)</td>
<td rowspan="1" colspan="1">3 311 421 (24.54)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>3 026 916 (22.43)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 445 521 (10.71)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 027 219 (7.61)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 617 763 (11.99)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Pfam</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">9357 (1.02)</td>
<td rowspan="1" colspan="1">4033 (0.44)</td>
<td rowspan="1" colspan="1">1526 (0.17)</td>
<td rowspan="1" colspan="1">322 937 (35.34)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">3 045 465 (22.57)</td>
<td rowspan="1" colspan="1">1 451 928 (10.76)</td>
<td rowspan="1" colspan="1">1 029 755 (7.63)</td>
<td rowspan="1" colspan="1">3 739 076 (27.71)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>3 029 337 (22.45)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 446 890 (10.72)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 027 247 (7.61)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1 852 223 (13.72)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Singletons</td>
<td rowspan="1" colspan="1">2608 (0.02)</td>
<td rowspan="1" colspan="1">10
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">5
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">1 515 720 (11.23)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Without Pfam</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">452 (0.05)</td>
<td rowspan="1" colspan="1">176 (0.02)</td>
<td rowspan="1" colspan="1">30
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">45 539 (4.98)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">46 311 (0.34)</td>
<td rowspan="1" colspan="1">17 020 (0.13)</td>
<td rowspan="1" colspan="1">7400 (0.05)</td>
<td rowspan="1" colspan="1">330 354 (2.45)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>45 803 (0.34)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>16 862 (0.12)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>7362 (0.05)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>226 500 (1.68)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Singletons</td>
<td rowspan="1" colspan="1">279
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">2
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">2
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">129 212 (0.96)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Without GO</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Pfam validated</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">679 (0.07)</td>
<td rowspan="1" colspan="1">345 (0.04)</td>
<td rowspan="1" colspan="1">15
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">54 314 (5.94)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">44 172 (0.33)</td>
<td rowspan="1" colspan="1">27 775 (0.21)</td>
<td rowspan="1" colspan="1">654
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">547 459 (4.06)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>43 416 (0.32)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>27 410 (0.20)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>633</bold>
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">
<bold>221 585 (1.64)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Pfam</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">779 (0.09)</td>
<td rowspan="1" colspan="1">377 (0.04)</td>
<td rowspan="1" colspan="1">16
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">122 236 (13.38)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">44 582 (0.33)</td>
<td rowspan="1" colspan="1">27 983 (0.21)</td>
<td rowspan="1" colspan="1">656
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">695 684 (5.15)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>43 735 (0.32)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>27 592 (0.20)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>634</bold>
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">
<bold>301 792 (2.24)</bold>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Singletons</td>
<td rowspan="1" colspan="1">205
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">
<bold>1</bold>
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">
<bold>0</bold>
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">702 834 (5.21)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">    Without Pfam</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Clusters</td>
<td rowspan="1" colspan="1">270 (0.03)</td>
<td rowspan="1" colspan="1">83 (0.01)</td>
<td rowspan="1" colspan="1">5
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">412 192 (45.11)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Sequences</td>
<td rowspan="1" colspan="1">5308 (0.04)</td>
<td rowspan="1" colspan="1">1771 (0.01)</td>
<td rowspan="1" colspan="1">154
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">1 494 443 (11.07)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">        
<bold>Inherited</bold>
</td>
<td rowspan="1" colspan="1">
<bold>5023 (0.04)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>1689 (0.01)</bold>
</td>
<td rowspan="1" colspan="1">
<bold>149</bold>
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1"></td>
</tr>
<tr>
<td rowspan="1" colspan="1">        Singletons</td>
<td rowspan="1" colspan="1">129
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">1
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">0
<xref ref-type="table-fn" rid="TF2">
<sup>a</sup>
</xref>
</td>
<td rowspan="1" colspan="1">1 743 526 (12.92)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TF1">
<p>Percentage is evaluated with respect to the total number of sequences in the data base (13 495 736 sequences). Bold character: sequences that inherit the annotation type</p>
</fn>
<fn id="TF2">
<p>
<sup>a</sup>
Values are negligible. Validated:
<italic>P</italic>
 ≤ 0.01 (See text for details, 11). Within BAR
<sup>+</sup>
clusters, 35 different types of annotations are possible: (i) +GO+Pfam+PDB [with or without SCOP (Monodomain, Multidomain)*]; GO and Pfam are or not validated (no. of levels = 12). (ii) +Pfam+PDB (with or without SCOP)* (no. of levels = 6). (iii) +GO+PDB (with or without SCOP)* (number of levels = 6). (iv) +Pfam+GO (no. of levels = 4). (v) +PDB (with or without SCOP)* (number of levels = 3). (vi) +GO (no. of levels = 2). (vii) +Pfam (no. of levels = 2). Seventy percent of the initial set fall into clusters (913 962) and 53% in validated clusters. Some 6% of the sequences are annotated without validation and the remaining 11% are not annotated (rightmost bottom cell). About 17 and 13% of the sequences are singletons with and without annotations, respectively.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec>
<title>SUBMITTING A PROTEIN SEQUENCE TO BAR
<sup>+</sup>
</title>
<p>When a query sequence is submitted, there are three possible outcomes (
<xref ref-type="fig" rid="F3">Figure 3</xref>
). The sequence can match a sequence already present in the cluster (or in a singleton). By this, non-annotated proteins can inherit functional and structural annotation from other proteins within the same cluster. Validated annotations are inherited when clusters are endowed with validated GO and Pfam (
<italic>P</italic>
 < 0.01). Alternatively a BLAST alignment starts. The query sequence may then align with any other sequence in BAR
<sup>+</sup>
with the stringent criteria of our procedure and, therefore, find a cluster from where it can safely inherit all the corresponding structural and functional features. Alternatively, when the criteria are not met, all the BLAST matches are returned. This allows anyway locating the sequence within a cluster. However, in this case, annotation through inheritance should be manually curated. Singletons may be or not source of information depending on their annotation.
<fig id="F3" position="float">
<label>Figure 3.</label>
<caption>
<p>BAR
<sup>+</sup>
at work. A query sequence has been submitted. Provided that the sequence after running BLAST has a level of SI ≥ 40% with a COV ≥ 90% to any sequence of BAR
<sup>+</sup>
, it is included into a cluster. In the above example, the cluster is well annotated and the sequence inherits all the possible annotations from the cluster including GO terms (203), PDB/s, ligands, SCOP and Pfam annotations and the Cluster-HMM. Furthermore in PIR format alignment/alignments of the query sequence to the cluster template/s with Cluster HMM is/are also provided. All the sequences that align with the query are returned. (•••) Only the top and bottom portions of the page are shown.</p>
</caption>
<graphic xlink:href="gkr292f3"></graphic>
</fig>
</p>
</sec>
<sec>
<title>BAR
<sup>+</sup>
UPDATE</title>
<p>BAR
<sup>+</sup>
collects sequences and their features from UniProtKB and genome repositories. Our re-clustering is programmed on a yearly base. BAR
<sup>+</sup>
cluster annotation will be updated every 6 months. This is based on the notion that indeed the BAR
<sup>+</sup>
annotation system increases its capacity only when we add information. This is achieved when proteins with evidence at the transcript and protein level (e.g.: PDB new files and/or proteins with GO/Pfam terms) are included in the system. For example, by comparing UniprotKB 05_2010 with SwissProt 03_2011, we collected some 2445 sequences carrying information according to our criteria (evidence at protein/transcript level). By aligning this set towards BAR
<sup>+</sup>
clusters, we find that 62% of the sequences fall into already validated clusters. About 8% aligns with singletons and only 0.03% of the total number of BAR
<sup>+</sup>
singletons become new clusters (with two protein sequences). Another 7% fall into non-validated clusters without affecting the statistical significance of the cluster-specific annotation. The remaining 23% originate new singletons. We are currently planning to include other annotation resources in order to extend our annotation process with more protein domains and their interactions.</p>
</sec>
<sec>
<title>FUNDING</title>
<p>D.P. is the recipient of a MIUR (Ministero Istruzione Università Ricerca) fellowship supporting his Ph.D. program; MIUR-FIRB (Fondo per gli Investimenti della Ricerca di Base) 2003/LIBI-International Laboratory for Bioinformatics delivered (to R.C., in part). Funding for open access charge:
<funding-source>Fondo Ordinario per le Università (FFO)</funding-source>
2010 delivered (to R.C. and P.L.M.).</p>
<p>
<italic>Conflict of interest statement</italic>
. None declared.</p>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>The authors would like to thank INFN (Istituto Nazionale di Fisica Nucleare) and CNAF (Centro Nazionale per la Ricerca e Sviluppo delle Tecnologie Informatiche e Telematiche) for support in GRID computing.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krause</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Stoye</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Vingron</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>The SYSTERS protein sequence cluster set</article-title>
<source>Nucleic Acids Res.</source>
<year>2002</year>
<volume>28</volume>
<fpage>270</fpage>
<lpage>272</lpage>
<pub-id pub-id-type="pmid">10592244</pub-id>
</element-citation>
</ref>
<ref id="B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heger</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Holm</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Picasso: generating a covering set of protein family profiles</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>17</volume>
<fpage>272</fpage>
<lpage>279</lpage>
<pub-id pub-id-type="pmid">11294792</pub-id>
</element-citation>
</ref>
<ref id="B3">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>CH</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Nikolskaya</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Barker</surname>
<given-names>WC</given-names>
</name>
</person-group>
<article-title>The iProClass integrated data base for protein functional analysis</article-title>
<source>Nucleic Acids Res.</source>
<year>2001</year>
<volume>29</volume>
<fpage>52</fpage>
<lpage>54</lpage>
<pub-id pub-id-type="pmid">11125047</pub-id>
</element-citation>
</ref>
<ref id="B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kriventseva</surname>
<given-names>EV</given-names>
</name>
<name>
<surname>Fleischmann</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Zdobnov</surname>
<given-names>EM</given-names>
</name>
<name>
<surname>Apweiler</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>CluSTr: a data base of clusters of SWISS-PROT+TrEMBL proteins</article-title>
<source>Nucleic Acids Res.</source>
<year>2001</year>
<volume>29</volume>
<fpage>33</fpage>
<lpage>36</lpage>
<pub-id pub-id-type="pmid">11125042</pub-id>
</element-citation>
</ref>
<ref id="B5">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Petryszak</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Kretschmann</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Wieser</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Apweiler</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>The predictive power of the CluSTr data base</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>3604</fpage>
<lpage>3609</lpage>
<pub-id pub-id-type="pmid">15961444</pub-id>
</element-citation>
</ref>
<ref id="B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kaplan</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sasson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Inbar</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Friedlich</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fromer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fleischer</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Portugaly</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>ProtoNet 4.0: a hierarchical classification of one million protein sequences</article-title>
<source>Nucleic Acids Res.</source>
<year>2005</year>
<volume>33</volume>
<fpage>D216</fpage>
<lpage>D218</lpage>
<pub-id pub-id-type="pmid">15608180</pub-id>
</element-citation>
</ref>
<ref id="B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Loewenstein</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Portugaly</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Fromer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Linial</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Efficient algorithms for accurate hierarchical clustering of huge data sets: tackling the entire protein space</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<fpage>i41</fpage>
<lpage>i49</lpage>
<pub-id pub-id-type="pmid">18586742</pub-id>
</element-citation>
</ref>
<ref id="B8">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sperisen</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pagni</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture</article-title>
<source>BMC Bioinformatics</source>
<year>2005</year>
<volume>6</volume>
<fpage>216</fpage>
<lpage>227</lpage>
<pub-id pub-id-type="pmid">16135248</pub-id>
</element-citation>
</ref>
<ref id="B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Enright</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Van Dongen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>CA</given-names>
</name>
</person-group>
<article-title>An efficient algorithm for large-scale detection of protein families</article-title>
<source>Nucleic Acids Res.</source>
<year>2002</year>
<volume>30</volume>
<fpage>1575</fpage>
<lpage>1584</lpage>
<pub-id pub-id-type="pmid">11917018</pub-id>
</element-citation>
</ref>
<ref id="B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cuff</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Sillitoe</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Clegg</surname>
<given-names>AB</given-names>
</name>
<name>
<surname>Rentzsch</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Furnham</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Pellegrini-Calace</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Thornton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Orengo</surname>
<given-names>CA</given-names>
</name>
</person-group>
<article-title>Extending CATH: increasing coverage of the protein structure universe and linking structure with function</article-title>
<source>Nucleic Acids Res.</source>
<year>2011</year>
<volume>39</volume>
<fpage>D420</fpage>
<lpage>D426</lpage>
<pub-id pub-id-type="pmid">21097779</pub-id>
</element-citation>
</ref>
<ref id="B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bartoli</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Montanucci</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Fronza</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Martelli</surname>
<given-names>PL</given-names>
</name>
<name>
<surname>Fariselli</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Carota</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Donvito</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Maggi</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Casadio</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>The Bologna Annotation Resource: a non-hierarchical method for the functional and structural annotation of protein sequences relying on a comparative large-scale genome analysis</article-title>
<source>J. Proteome. Res.</source>
<year>2009</year>
<volume>8</volume>
<fpage>4362</fpage>
<lpage>4371</lpage>
<pub-id pub-id-type="pmid">19552451</pub-id>
</element-citation>
</ref>
<ref id="B12">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>McGinnis</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
</person-group>
<article-title>BLAST: at the core of a powerful and diverse set of sequence analysis tools</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<issue>Web Server issue</issue>
<fpage>W20</fpage>
<lpage>W25</lpage>
<pub-id pub-id-type="pmid">15215342</pub-id>
</element-citation>
</ref>
<ref id="B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Konagurthu</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Whisstock</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Stuckey</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Lesk</surname>
<given-names>AL</given-names>
</name>
</person-group>
<article-title>MUSTANG: a multiple structural alignment algorithm</article-title>
<source>Proteins: Structure, Function, and Bioinformatics</source>
<year>2006</year>
<volume>64</volume>
<fpage>559</fpage>
<lpage>574</lpage>
</element-citation>
</ref>
<ref id="B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edgar</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>MUSCLE: multiple sequence alignment with high accuracy and high throughput</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<fpage>1792</fpage>
<lpage>1797</lpage>
<pub-id pub-id-type="pmid">15034147</pub-id>
</element-citation>
</ref>
<ref id="B15">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eddy</surname>
<given-names>SR</given-names>
</name>
</person-group>
<article-title>Profile hidden Markov models</article-title>
<source>Bioinformatics</source>
<year>1998</year>
<volume>14</volume>
<fpage>755</fpage>
<lpage>763</lpage>
<pub-id pub-id-type="pmid">9918945</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/TelematiV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000505 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000505 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    TelematiV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3125743
   |texte=   BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21622657" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a TelematiV1 

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Thu Nov 2 16:09:04 2017. Site generation: Sun Mar 10 16:42:28 2024