Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Functional Representation of Enzymes by Specific Peptides

Identifieur interne : 000F82 ( Pmc/Corpus ); précédent : 000F81; suivant : 000F83

Functional Representation of Enzymes by Specific Peptides

Auteurs : Vered Kunik ; Yasmine Meroz ; Zach Solan ; Ben Sandbank ; Uri Weingart ; Eytan Ruppin ; David Horn

Source :

RBID : PMC:1950953

Abstract

Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 ± 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.


Url:
DOI: 10.1371/journal.pcbi.0030167
PubMed: 17722976
PubMed Central: 1950953

Links to Exploration step

PMC:1950953

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Functional Representation of Enzymes by Specific Peptides</title>
<author>
<name sortKey="Kunik, Vered" sort="Kunik, Vered" uniqKey="Kunik V" first="Vered" last="Kunik">Vered Kunik</name>
<affiliation>
<nlm:aff id="aff1"> School of Computer Science, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Meroz, Yasmine" sort="Meroz, Yasmine" uniqKey="Meroz Y" first="Yasmine" last="Meroz">Yasmine Meroz</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Solan, Zach" sort="Solan, Zach" uniqKey="Solan Z" first="Zach" last="Solan">Zach Solan</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sandbank, Ben" sort="Sandbank, Ben" uniqKey="Sandbank B" first="Ben" last="Sandbank">Ben Sandbank</name>
<affiliation>
<nlm:aff id="aff1"> School of Computer Science, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Weingart, Uri" sort="Weingart, Uri" uniqKey="Weingart U" first="Uri" last="Weingart">Uri Weingart</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ruppin, Eytan" sort="Ruppin, Eytan" uniqKey="Ruppin E" first="Eytan" last="Ruppin">Eytan Ruppin</name>
<affiliation>
<nlm:aff id="aff1"> School of Computer Science, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3"> Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Horn, David" sort="Horn, David" uniqKey="Horn D" first="David" last="Horn">David Horn</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">17722976</idno>
<idno type="pmc">1950953</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950953</idno>
<idno type="RBID">PMC:1950953</idno>
<idno type="doi">10.1371/journal.pcbi.0030167</idno>
<date when="2007">2007</date>
<idno type="wicri:Area/Pmc/Corpus">000F82</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F82</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Functional Representation of Enzymes by Specific Peptides</title>
<author>
<name sortKey="Kunik, Vered" sort="Kunik, Vered" uniqKey="Kunik V" first="Vered" last="Kunik">Vered Kunik</name>
<affiliation>
<nlm:aff id="aff1"> School of Computer Science, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Meroz, Yasmine" sort="Meroz, Yasmine" uniqKey="Meroz Y" first="Yasmine" last="Meroz">Yasmine Meroz</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Solan, Zach" sort="Solan, Zach" uniqKey="Solan Z" first="Zach" last="Solan">Zach Solan</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sandbank, Ben" sort="Sandbank, Ben" uniqKey="Sandbank B" first="Ben" last="Sandbank">Ben Sandbank</name>
<affiliation>
<nlm:aff id="aff1"> School of Computer Science, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Weingart, Uri" sort="Weingart, Uri" uniqKey="Weingart U" first="Uri" last="Weingart">Uri Weingart</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ruppin, Eytan" sort="Ruppin, Eytan" uniqKey="Ruppin E" first="Eytan" last="Ruppin">Eytan Ruppin</name>
<affiliation>
<nlm:aff id="aff1"> School of Computer Science, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3"> Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Horn, David" sort="Horn, David" uniqKey="Horn D" first="David" last="Horn">David Horn</name>
<affiliation>
<nlm:aff id="aff2"> School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS Computational Biology</title>
<idno type="ISSN">1553-734X</idno>
<idno type="eISSN">1553-7358</idno>
<imprint>
<date when="2007">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 ± 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Domingues, Fs" uniqKey="Domingues F">FS Domingues</name>
</author>
<author>
<name sortKey="Lengauer, T" uniqKey="Lengauer T">T Lengauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rost, B" uniqKey="Rost B">B Rost</name>
</author>
<author>
<name sortKey="Yachdav, G" uniqKey="Yachdav G">G Yachdav</name>
</author>
<author>
<name sortKey="Liu, J" uniqKey="Liu J">J Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tian, W" uniqKey="Tian W">W Tian</name>
</author>
<author>
<name sortKey="Skolnick, J" uniqKey="Skolnick J">J Skolnick</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hegyi, H" uniqKey="Hegyi H">H Hegyi</name>
</author>
<author>
<name sortKey="Gerstein, M" uniqKey="Gerstein M">M Gerstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rost, B" uniqKey="Rost B">B Rost</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Von Grotthuss, M" uniqKey="Von Grotthuss M">M von Grotthuss</name>
</author>
<author>
<name sortKey="Plewczynski, D" uniqKey="Plewczynski D">D Plewczynski</name>
</author>
<author>
<name sortKey="Ginalsky, K" uniqKey="Ginalsky K">K Ginalsky</name>
</author>
<author>
<name sortKey="Rychlewski, L" uniqKey="Rychlewski L">L Rychlewski</name>
</author>
<author>
<name sortKey="Shakhnovich, Ei" uniqKey="Shakhnovich E">EI Shakhnovich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
<author>
<name sortKey="Koonin, Ev" uniqKey="Koonin E">EV Koonin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bairoch, A" uniqKey="Bairoch A">A Bairoch</name>
</author>
<author>
<name sortKey="Bucher, P" uniqKey="Bucher P">P Bucher</name>
</author>
<author>
<name sortKey="Hofmann, K" uniqKey="Hofmann K">K Hofmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Aitken, A" uniqKey="Aitken A">A Aitken</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Neville Manning, Cg" uniqKey="Neville Manning C">CG Neville-Manning</name>
</author>
<author>
<name sortKey="Wu, Td" uniqKey="Wu T">TD Wu</name>
</author>
<author>
<name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, Jy" uniqKey="Huang J">JY Huang</name>
</author>
<author>
<name sortKey="Brutlag, Dl" uniqKey="Brutlag D">DL Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Falquet, L" uniqKey="Falquet L">L Falquet</name>
</author>
<author>
<name sortKey="Pagni, M" uniqKey="Pagni M">M Pagni</name>
</author>
<author>
<name sortKey="Bucher, P" uniqKey="Bucher P">P Bucher</name>
</author>
<author>
<name sortKey="Hulo, N" uniqKey="Hulo N">N Hulo</name>
</author>
<author>
<name sortKey="Sigrist, Cj" uniqKey="Sigrist C">CJ Sigrist</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tong, Ah" uniqKey="Tong A">AH Tong</name>
</author>
<author>
<name sortKey="Drees, B" uniqKey="Drees B">B Drees</name>
</author>
<author>
<name sortKey="Nardelli, G" uniqKey="Nardelli G">G Nardelli</name>
</author>
<author>
<name sortKey="Bader, Gd" uniqKey="Bader G">GD Bader</name>
</author>
<author>
<name sortKey="Brannetti, B" uniqKey="Brannetti B">B Brannetti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Obenauer, Jc" uniqKey="Obenauer J">JC Obenauer</name>
</author>
<author>
<name sortKey="Yaffe, Mb" uniqKey="Yaffe M">MB Yaffe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Solan, Z" uniqKey="Solan Z">Z Solan</name>
</author>
<author>
<name sortKey="Horn, D" uniqKey="Horn D">D Horn</name>
</author>
<author>
<name sortKey="Ruppin, E" uniqKey="Ruppin E">E Ruppin</name>
</author>
<author>
<name sortKey="Edelman, S" uniqKey="Edelman S">S Edelman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ben Hur, A" uniqKey="Ben Hur A">A Ben-Hur</name>
</author>
<author>
<name sortKey="Brutlag, D" uniqKey="Brutlag D">D Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liao, L" uniqKey="Liao L">L Liao</name>
</author>
<author>
<name sortKey="Noble, Ws" uniqKey="Noble W">WS Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cai, Cz" uniqKey="Cai C">CZ Cai</name>
</author>
<author>
<name sortKey="Han, Ly" uniqKey="Han L">LY Han</name>
</author>
<author>
<name sortKey="Ji, Zl" uniqKey="Ji Z">ZL Ji</name>
</author>
<author>
<name sortKey="Chen, Yz" uniqKey="Chen Y">YZ Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cai, Cz" uniqKey="Cai C">CZ Cai</name>
</author>
<author>
<name sortKey="Han, Ly" uniqKey="Han L">LY Han</name>
</author>
<author>
<name sortKey="Ji, Zl" uniqKey="Ji Z">ZL Ji</name>
</author>
<author>
<name sortKey="Chen, Yz" uniqKey="Chen Y">YZ Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Schaffer, Aa" uniqKey="Schaffer A">AA Schaffer</name>
</author>
<author>
<name sortKey="Zhan, Jz" uniqKey="Zhan J">JZ Zhan</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ben Hur, A" uniqKey="Ben Hur A">A Ben-Hur</name>
</author>
<author>
<name sortKey="Brutlag, D" uniqKey="Brutlag D">D Brutlag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Foster, Pg" uniqKey="Foster P">PG Foster</name>
</author>
<author>
<name sortKey="Huang, L" uniqKey="Huang L">L Huang</name>
</author>
<author>
<name sortKey="Santi, Dv" uniqKey="Santi D">DV Santi</name>
</author>
<author>
<name sortKey="Stroud, Rm" uniqKey="Stroud R">RM Stroud</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Anda, P" uniqKey="Anda P">P Anda</name>
</author>
<author>
<name sortKey="Gebbia, Ja" uniqKey="Gebbia J">JA Gebbia</name>
</author>
<author>
<name sortKey="Backenson, Pb" uniqKey="Backenson P">PB Backenson</name>
</author>
<author>
<name sortKey="Coleman, Jl" uniqKey="Coleman J">JL Coleman</name>
</author>
<author>
<name sortKey="Benach, Jl" uniqKey="Benach J">JL Benach</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hanks, Sk" uniqKey="Hanks S">SK Hanks</name>
</author>
<author>
<name sortKey="Quinn, Am" uniqKey="Quinn A">AM Quinn</name>
</author>
<author>
<name sortKey="Hunter, T" uniqKey="Hunter T">T Hunter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Walker, Je" uniqKey="Walker J">JE Walker</name>
</author>
<author>
<name sortKey="Saraste, M" uniqKey="Saraste M">M Saraste</name>
</author>
<author>
<name sortKey="Runswick, Mj" uniqKey="Runswick M">MJ Runswick</name>
</author>
<author>
<name sortKey="Gay, Nj" uniqKey="Gay N">NJ Gay</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Binkowski, Ta" uniqKey="Binkowski T">TA Binkowski</name>
</author>
<author>
<name sortKey="Naghibzadeg, S" uniqKey="Naghibzadeg S">S Naghibzadeg</name>
</author>
<author>
<name sortKey="Liang, J" uniqKey="Liang J">J Liang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benjamini, Y" uniqKey="Benjamini Y">Y Benjamini</name>
</author>
<author>
<name sortKey="Hochberg, Y" uniqKey="Hochberg Y">Y Hochberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ogiwara, A" uniqKey="Ogiwara A">A Ogiwara</name>
</author>
<author>
<name sortKey="Uchiyama, I" uniqKey="Uchiyama I">I Uchiyama</name>
</author>
<author>
<name sortKey="Seto, Y" uniqKey="Seto Y">Y Seto</name>
</author>
<author>
<name sortKey="Kanehisa, M" uniqKey="Kanehisa M">M Kanehisa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Jtl" uniqKey="Wang J">JTL Wang</name>
</author>
<author>
<name sortKey="Marr, Tg" uniqKey="Marr T">TG Marr</name>
</author>
<author>
<name sortKey="Shasha, D" uniqKey="Shasha D">D Shasha</name>
</author>
<author>
<name sortKey="Shapiro, Ba" uniqKey="Shapiro B">BA Shapiro</name>
</author>
<author>
<name sortKey="Chirn, Gw" uniqKey="Chirn G">GW Chirn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rigoutsos, I" uniqKey="Rigoutsos I">I Rigoutsos</name>
</author>
<author>
<name sortKey="Floratos, A" uniqKey="Floratos A">A Floratos</name>
</author>
<author>
<name sortKey="Ouzounis, C" uniqKey="Ouzounis C">C Ouzounis</name>
</author>
<author>
<name sortKey="Gao, Y" uniqKey="Gao Y">Y Gao</name>
</author>
<author>
<name sortKey="Parida, L" uniqKey="Parida L">L Parida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Martin, Dm" uniqKey="Martin D">DM Martin</name>
</author>
<author>
<name sortKey="Berriman, M" uniqKey="Berriman M">M Berriman</name>
</author>
<author>
<name sortKey="Barton, Gj" uniqKey="Barton G">GJ Barton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hawkins, T" uniqKey="Hawkins T">T Hawkins</name>
</author>
<author>
<name sortKey="Luban, S" uniqKey="Luban S">S Luban</name>
</author>
<author>
<name sortKey="Kihara, D" uniqKey="Kihara D">D Kihara</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smith, T" uniqKey="Smith T">T Smith</name>
</author>
<author>
<name sortKey="Waterman, M" uniqKey="Waterman M">M Waterman</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">pcbi</journal-id>
<journal-id journal-id-type="publisher-id">plcb</journal-id>
<journal-id journal-id-type="pmc">ploscomp</journal-id>
<journal-title-group>
<journal-title>PLoS Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1553-734X</issn>
<issn pub-type="epub">1553-7358</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">17722976</article-id>
<article-id pub-id-type="pmc">1950953</article-id>
<article-id pub-id-type="doi">10.1371/journal.pcbi.0030167</article-id>
<article-id pub-id-type="publisher-id">07-PLCB-RA-0027R3</article-id>
<article-id pub-id-type="sici">plcb-03-08-17</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline">
<subject>Computational Biology</subject>
</subj-group>
<subj-group subj-group-type="System Taxonomy">
<subject>None</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Functional Representation of Enzymes by Specific Peptides</article-title>
<alt-title alt-title-type="running-head">Specific Peptides</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Kunik</surname>
<given-names>Vered</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Meroz</surname>
<given-names>Yasmine</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Solan</surname>
<given-names>Zach</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sandbank</surname>
<given-names>Ben</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Weingart</surname>
<given-names>Uri</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ruppin</surname>
<given-names>Eytan</given-names>
</name>
<xref ref-type="aff" rid="aff1">1</xref>
<xref ref-type="aff" rid="aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Horn</surname>
<given-names>David</given-names>
</name>
<xref ref-type="aff" rid="aff2">2</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
School of Computer Science, Tel Aviv University, Tel Aviv, Israel</aff>
<aff id="aff2">
<label>2</label>
School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel</aff>
<aff id="aff3">
<label>3</label>
Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Ofran</surname>
<given-names>Yanay</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">Columbia University, United States of America</aff>
<author-notes>
<corresp id="cor1">* To whom correspondence should be addressed. E-mail:
<email>horn@tau.ac.il</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<month>8</month>
<year>2007</year>
</pub-date>
<pub-date pub-type="epub">
<day>24</day>
<month>8</month>
<year>2007</year>
</pub-date>
<pub-date pub-type="epreprint">
<day>11</day>
<month>7</month>
<year>2007</year>
</pub-date>
<volume>3</volume>
<issue>8</issue>
<elocation-id>e167</elocation-id>
<history>
<date date-type="received">
<day>17</day>
<month>1</month>
<year>2007</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>7</month>
<year>2007</year>
</date>
</history>
<permissions>
<copyright-statement> © 2007 Kunik et al.</copyright-statement>
<copyright-year>2007</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<pmc-comment>Functional representation of enzymes by specific peptides</pmc-comment>
<abstract>
<p>Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 ± 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.</p>
</abstract>
<abstract abstract-type="summary">
<title>Author Summary</title>
<sec id="st1">
<title></title>
<p>Sequence motifs are known to provide information about functional properties of proteins. In the past, many approaches have looked for deterministic motifs in protein sequences, by searching for functionally over-represented k-mers, with moderate levels of success. Here we revisit and renew the utility of deterministic motifs, by searching for them in a partially unsupervised and context-dependent manner. Using a novel motif extraction algorithm, MEX, deterministic sequence motifs are extracted from Swiss Prot data containing more than 50,000 enzymes. They are then filtered by the Enzyme Commission classification hierarchy to produce sets of specific peptides (SPs). The latter specify enzyme function for 93% of the data, comparing well with existing approaches for enzyme classification. Importantly, SPs are found to have biological significance. A majority of all known active and binding sites of enzymes are covered by SPs, and many SPs are found to lie within spatial pockets in the neighborhood of the active sites. Both these results have extremely high statistical significance. A user-friendly tool that displays the hits of SPs for any protein sequence that is presented as a query, together with the EC assignments due to these SPs, is available at
<ext-link ext-link-type="uri" xlink:href="http://adios.tau.ac.il/SPSearch">http://adios.tau.ac.il/SPSearch</ext-link>
.</p>
</sec>
</abstract>
<counts>
<page-count count="10"></page-count>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>citation</meta-name>
<meta-value>Kunik V, Meroz Y, Solan Z, Sandbank B, Weingart U, et al. (2007) Functional representation of enzymes by specific peptides. PLoS Comput Biol 3(8): e167. doi:
<ext-link ext-link-type="doi" xlink:href="10.1371/journal.pcbi.0030167">10.1371/journal.pcbi.0030167</ext-link>
</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>One of the major efforts of computational research in molecular biology is to predict the function and spatial structure of proteins from the protein sequence of amino acids [
<xref rid="pcbi-0030167-b001" ref-type="bibr">1</xref>
,
<xref rid="pcbi-0030167-b002" ref-type="bibr">2</xref>
]. Conventional approaches to function prediction rely on sequence [
<xref rid="pcbi-0030167-b003" ref-type="bibr">3</xref>
] or structure [
<xref rid="pcbi-0030167-b004" ref-type="bibr">4</xref>
] similarity with proteins whose functions are known. This is sometimes misleading [
<xref rid="pcbi-0030167-b004" ref-type="bibr">4</xref>
<xref rid="pcbi-0030167-b006" ref-type="bibr">6</xref>
]. Alternatively, one may use motif approaches [
<xref rid="pcbi-0030167-b007" ref-type="bibr">7</xref>
<xref rid="pcbi-0030167-b012" ref-type="bibr">12</xref>
], trying to extract from the data subsequences that are responsible for particular functions. Motifs can be deterministic sequences of amino acids, regular expressions that allow various alternatives for specific locations within the motif, or stochastic structures specifying the probability of an amino acid at every location. This work aims to uncover deterministic sequence motifs, and considers their relationships with protein functionality. We focus on enzymes, whose functions are classified by the Enzyme Commission (EC) four-level hierarchy which is represented by four integers, n1.n2.n3.n4, corresponding to the different levels of classification. For example, the oxidoreductases class corresponds to n1 = 1, one of the six main divisions. For this class, n2 (subclass) specifies electron donors, n3 (sub-subclass) specifies electron acceptor, and n4 indicates the exact enzymatic activity.</p>
<p>Conventional sequence motif searches in enzymes are performed in a supervised fashion, using sequences of proteins that are known to have the same function and looking for (deterministic, regular-expression, or stochastic) motifs that are over-represented in this group of proteins. The motifs in question should then subserve such functions as [
<xref rid="pcbi-0030167-b009" ref-type="bibr">9</xref>
] phosphorylation of protein kinases; metal binding sites for calcium, zinc, copper, and iron; enzyme active sites, etc. With the advent of studies of protein–protein interactions, interest grew in finding sequence motifs that are responsible for them, and span an “interaction space” [
<xref rid="pcbi-0030167-b013" ref-type="bibr">13</xref>
,
<xref rid="pcbi-0030167-b014" ref-type="bibr">14</xref>
].</p>
<p>Here we perform a large-scale search for deterministic sequence motifs without specifying a priori their exact functional roles, using the unsupervised motif extraction (MEX) algorithm [
<xref rid="pcbi-0030167-b015" ref-type="bibr">15</xref>
]. We have used one functional guidance: MEX was separately applied to each one of the six major EC classes. The same motifs may also appear in other classes, yet many of them turn out to occur in only one class, and belong to a specific EC branch. The latter (see
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
A) are termed specific peptides (SPs). By representing some 50,000 enzymes (of average length of 380 amino acids) in terms of about the same number of SPs (of average length 8.4), we obtain a largely compressed functional representation and an EC classification with 93% accuracy.</p>
<fig id="pcbi-0030167-g001" position="float">
<label>Figure 1</label>
<caption>
<title>The Occurrence of Specific Peptides within the EC Hierarchy of Enzymes</title>
<p>(A) A sketch of the EC hierarchy and the assignments of SPs to SP classes. SPs can be compared with those appearing in
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
B.</p>
<p>(B) Aligned sequences of two groups of enzymes of level 4 that share the same third-level assignment. Alignment is performed according to SPs. The organisms in the upper group, 5.1.3.20, belong to proteobacteria, while those of the lower group, 5.1.3.2, also contain eukaryotes (ARATH, CYATE, and PEA). Boldfaced substrings denote SPs. Amino acids flanked by spaces denote active sites and binding sites, as indicated above. A list of all SPs and their assignments to SPN classes is presented below the sequences.</p>
</caption>
<graphic xlink:href="pcbi.0030167.g001"></graphic>
</fig>
<p>This may be compared with other methods based on e-motifs [
<xref rid="pcbi-0030167-b016" ref-type="bibr">16</xref>
], sequence similarity [
<xref rid="pcbi-0030167-b017" ref-type="bibr">17</xref>
], or physicochemical properties of the amino acids contained in the sequence [
<xref rid="pcbi-0030167-b018" ref-type="bibr">18</xref>
,
<xref rid="pcbi-0030167-b019" ref-type="bibr">19</xref>
]. Our results compare favorably with such methods, as will be shown below, yet our approach differs in several respects: we use a largely unsupervised motif extraction method, we perform a comprehensive study of all enzymes, and we put major emphasis on the biological relevance of the SPs themselves.</p>
<p>Importantly, in comparison with the large-scale and popular motif database ProSite [
<xref rid="pcbi-0030167-b008" ref-type="bibr">8</xref>
], our approach displays a wide-margin advantage, their motifs coverage extending only to 63% of all enzymes in the database.</p>
</sec>
<sec id="s2">
<title>Results</title>
<sec id="s2a">
<title>The Specific Peptides</title>
<p>SPs, as defined above, are MEX motifs that are specific to a single branch of the EC hierarchical classification. Most belong to single branches of the fourth level of the hierarchy, to be denoted as SPs of level 4 (SP4) (see
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
A). SPs of higher hierarchy, SP3, SP2, and SP1, appear in more than one lower EC level. Thus, if a peptide is shared by two or more level 4 groups that belong to the same third EC level, and appears nowhere else, it is assigned to SP3. The SPs were further screened to eliminate any peptide that includes within it another peptide carrying the same SPN (N = 1,2,3,4) label.</p>
<p>The majority of SPs found at level 4 of the EC hierarchy (
<xref ref-type="table" rid="pcbi-0030167-t001">Table 1</xref>
) are probably due to the high homology within this level, that often includes many orthologous genes. Thousands of SPs occur at higher levels of hierarchy, reflecting functional similarity among enzymes with lower sequence similarity. The occurrence of any one SP on the sequence of an enzyme specifies its EC functionality according to the specific branch N of its SPN. For example, enzyme P45048 (see
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
B) contains SSAATYG, an SP3 specific to 5.1.3, and LNVYGYSK, an SP4 specific to 5.1.3.20. The relationship of these SPs to the EC hierarchy of SP families is shown in
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
A.</p>
<table-wrap id="pcbi-0030167-t001" content-type="2col" position="float">
<label>Table 1</label>
<caption>
<p>Specific Peptides in All Six Classes of Swiss-Prot Release 48.3</p>
</caption>
<graphic xlink:href="pcbi.0030167.t001"></graphic>
</table-wrap>
<p>
<xref ref-type="table" rid="pcbi-0030167-t001">Table 1</xref>
shows that the SPs cover (i.e., appear on the sequence of) most enzymes in the dataset. The coverage columns display the cumulative coverage of all SPs to their left. Coverage is a measure of the success of the SP approach. Thus, from the sixth column one can deduce that functional classification at the third level of EC is specified by 45,819 peptides of SP3 and SP4, covering 89.8% of the data.</p>
<p>Information about the separate coverage of each SPN group is provided in
<xref ref-type="supplementary-material" rid="pcbi-0030167-st001">Table S1</xref>
. The length distribution of SPs is displayed in
<xref ref-type="supplementary-material" rid="pcbi-0030167-sg001">Figure S1</xref>
for all enzyme classes. No SP exists with a length shorter than four amino acids. The average SP length is 8.4 (s.d. 4.5). The distribution of the number of SPs occurring on enzymes is given in
<xref ref-type="supplementary-material" rid="pcbi-0030167-sg002">Figure S2</xref>
. It is very flat. On average, 15.6 SPs appear on each enzyme and the standard deviation is 16. Enzyme sequences that share long SPs are highly similar, while sharing short SPs indicates smaller sequence similarity. This is displayed for short (smaller than nine amino acids) and medium length (between nine and 12 amino acids) SPs in
<xref ref-type="supplementary-material" rid="pcbi-0030167-sg003">Figures S3</xref>
and
<xref ref-type="supplementary-material" rid="pcbi-0030167-sg004">S4</xref>
: most enzyme pairs that share SPs of length larger than 12 amino acids possess sequence identity of over 90%.</p>
</sec>
<sec id="s2b">
<title>Prediction of Enzyme Classes</title>
<p>The SwissProt 48.3 dataset contains 260 enzymes that have more than one annotation, and, therefore, have been excluded from the training set (see
<xref ref-type="sec" rid="s4">Methods</xref>
). Using them as a test set, we find 849 hits of SPs on 157 of these enzymes. 711 of the 849 hits agree with one of the given annotations and 138 do not, thus obtaining an accuracy of 84%. The results are displayed in
<xref ref-type="supplementary-material" rid="pcbi-0030167-st002">Table S2</xref>
, comparing the Swiss-Prot EC annotations with SP predictions. For example, the first protein on the list has Swiss-Prot EC annotations of 2.7.2.4 and 1.1.1.3. Its sequence matches two SPs, one SP1 of class 1 and one SP4 of 2.7.2.4. This is counted as two correct matches. An analysis of
<xref ref-type="supplementary-material" rid="pcbi-0030167-st002">Table S2</xref>
shows that predictions based on a single SP hit may be erroneous, while those based on more than two SPs whose EC assignments are consistent with one another are correct.</p>
<p>We have tested the generalization quality of our SP-based enzyme classification by running MEX on the Swiss-Prot 45 release (October 2004) and testing its predictions on 10,000 novel enzymes that are listed in the Swiss-Prot 48.3 release (for the relation between these two sets see
<xref ref-type="supplementary-material" rid="pcbi-0030167-sg005">Figure S5</xref>
and
<xref ref-type="supplementary-material" rid="pcbi-0030167-st003">Table S3</xref>
). Generalization quality is assessed in
<xref ref-type="table" rid="pcbi-0030167-t002">Table 2</xref>
by recall (matching SPs extracted from the 45 data on novel enzymes) and precision (correctness of the “45” EC assignment according to “48.3” annotations). Precision can be defined at the SP level, i.e., to what extent did the EC of this SP match the true EC of the enzyme that it hits. Precision can also be defined at the enzyme level: how many enzymes are correctly identified by all SPs that hit them. In other words, demanding the EC assignments of all SPs to be consistent with one another as well as with the “48.3” annotation of the enzyme. Overall recall is 84%. Precision at the SP level is almost perfect, 98.7%; nonetheless, at the enzyme level it reduces to 81.7%. The reason is that usually there are many SPs hitting each enzyme, and the small error at the SP level is magnified by the requirement that the EC labels of all SPs on the same enzyme should be consistent with each other.</p>
<table-wrap id="pcbi-0030167-t002" content-type="1col" position="float">
<label>Table 2</label>
<caption>
<p>Performance of SPs Extracted from the Swiss-Prot 45 Dataset on Novel Enzyme Sequences in Swiss-Prot 48.3</p>
</caption>
<graphic xlink:href="pcbi.0030167.t002"></graphic>
</table-wrap>
<p>This generalization test suffers from bias, i.e., there exist enzymes in the test set that have high sequence similarity to some enzymes in the training sets. In conventional machine-learning analysis of sequence to function classification [
<xref rid="pcbi-0030167-b002" ref-type="bibr">2</xref>
], one often tries to eliminate bias by avoiding high sequence similarity between proteins in the test set and proteins in the training set. In our case this is problematic, because it effectively calls for eliminating from the test set all enzymes that have four-digit EC numbers appearing in the training set. Alternatively, one could produce for each enzyme in the test set a new training set that does not contain sequences with the same EC number, which is both unconventional and computationally very complex.</p>
<p>To overcome this predicament, we have used the following procedure: a) start with the test set consisting of all sequences of SwissProt release 48.3 that do not appear in release 45; b) blast each one of these (test set) sequences against the sequences of the training set (SwissProt release 45) that do not have the same four-digit EC number; c) include in the non-redundant test set only sequences whose BLAST score [
<xref rid="pcbi-0030167-b020" ref-type="bibr">20</xref>
] with all other training sequences (including those with the same first three EC digits) is larger than 10
<sup>−3</sup>
; d) test generalization on this non-redundant set only for peptides in SP1, SP2, and SP3, thus avoiding the SP4 peptides that were extracted from the same fourth-level EC sequences as those of the non-redundant test set. It should be noted that removing the SP4 peptides makes the functional annotation task much more difficult because the coverage of enzymes by SPs is strongly reduced. Only 440 enzymes obey the BLAST > 10
<sup>−3</sup>
condition, and less than 40% of them carry SP1, SP2, and SP3 matches.</p>
<p>The results are displayed in
<xref ref-type="table" rid="pcbi-0030167-t003">Table 3</xref>
. We obtain correct classification with an accuracy of 88%. The test is that of precision of SP assignments, i.e., to what extent do the EC labels of the SPs, observed to exist on the enzyme sequences, correspond to “48.3” EC classifications.</p>
<table-wrap id="pcbi-0030167-t003" content-type="2col" position="float">
<label>Table 3</label>
<caption>
<p>Coverage of a Non-Redundant Test Set by Motifs in SP1, SP2, and SP3</p>
</caption>
<graphic xlink:href="pcbi.0030167.t003"></graphic>
</table-wrap>
<p>Whereas even the unbiased tests have high precision, we should emphasize that many successes of the SP approach are due to SP4 peptides, whose existence stems from high homology among different sequences that belong to the same EC number. These successes include the high coverage of enzymes (see
<xref ref-type="table" rid="pcbi-0030167-t001">Table 1</xref>
) and the coverage of active and binding sites to be discussed below. The fact that these SPs have been extracted by MEX may be viewed as the essence of homology, as illustrated in
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
B, where the existence of SPs is displayed on various enzymes aligned according to their matching SPs.</p>
<p>We provide a Web tool, available at
<ext-link ext-link-type="uri" xlink:href="http://adios.tau.ac.il/SPMatch">http://adios.tau.ac.il/SPMatch</ext-link>
, which displays the hits of SPs for any protein sequence that is presented as a query, together with the EC assignments due to these SPs.</p>
</sec>
<sec id="s2c">
<title>Comparison with Other Methods</title>
<p>We have tested the usefulness of the SP approach by comparing it with conventional functional prediction methods. For this purpose we have used all oxidoreductases in the 48.3 data and divided them into training data and test data with a 75%:25% ratio. MEX was run on all data and SPs were selected from the MEX motifs according to the training data. Only this subset of motifs was then employed to classify the test data. This procedure has been repeated 45 times to gain statistics, and has been subjected to a support vector machine (SVM) analysis. It has been compared with a state-of-the-art method [
<xref rid="pcbi-0030167-b017" ref-type="bibr">17</xref>
] based on an analogous SVM procedure, applied to the same data using the same divisions and relying on classification of (train and test) data according to a matrix of Smith-Waterman distances from all oxidoreductases. The results are displayed in
<xref ref-type="supplementary-material" rid="pcbi-0030167-st004">Tables S4</xref>
and
<xref ref-type="supplementary-material" rid="pcbi-0030167-st005">S5</xref>
and show a clear advantage to SP classification. For comparison, we use the Jaccard score defined as J = TP / (TP + FP + FN) where TP, FP, and FN denote true positives, false positives, and false negatives, accordingly. Whereas sequence similarity leads to an average Jaccard score of 0.86 on the second EC level and 0.82 on the third level, SP classification has average Jaccard scores of 0.93 and 0.92, accordingly. Comparing with yet another method, SVM-Prot [
<xref rid="pcbi-0030167-b018" ref-type="bibr">18</xref>
,
<xref rid="pcbi-0030167-b019" ref-type="bibr">19</xref>
], which classifies enzymes on the basis of physical and chemical features of their amino acids, we note that the latter achieves a Jaccard score of only 0.74 on all oxidoreductases data at the second EC level.</p>
<p>The common lore, that large sequence identity between two proteins implies that the two have the same function, has its exceptions. Motifs, although often extracted from homology, may serve as better measures for functional specification of proteins [
<xref rid="pcbi-0030167-b021" ref-type="bibr">21</xref>
] than overall sequence similarity.
<xref ref-type="table" rid="pcbi-0030167-t004">Table 4</xref>
demonstrates this point, by contrasting SP predictions with Smith-Waterman similarity results for pairs of enzymes. These extreme cases have been posed as a problem by Ross [
<xref rid="pcbi-0030167-b005" ref-type="bibr">5</xref>
] (see
<xref ref-type="table" rid="pcbi-0030167-t001">Table 1</xref>
there). All displayed EC assignments correspond to those of SPs located on the enzyme sequences, and match the correct EC numbers. As a more detailed example, we point out that the enzymes of the sixth pair in
<xref ref-type="table" rid="pcbi-0030167-t004">Table 4</xref>
, GTFB_STRMU and AMY3B_ORYSA, have 42% sequence identity along an alignment of 105 amino acids. Nonetheless, the sequences are not identical at the SP locations. AMY3B_ORYSA contains 24 SPs, none of which have an exact match on GTFB_STRMU, and a single SP4 (GGAFLE) found on the latter matches correctly its EC number.</p>
<table-wrap id="pcbi-0030167-t004" content-type="2col" position="float">
<label>Table 4</label>
<caption>
<p>Enzymes with High Sequence Similarity and Different EC Assignments</p>
</caption>
<graphic xlink:href="pcbi.0030167.t004"></graphic>
</table-wrap>
<p>It is of interest to compare our SPs with ProSite motifs [
<xref rid="pcbi-0030167-b008" ref-type="bibr">8</xref>
], which are listed in the Swiss-Prot database as standard motif annotations on 63% of the enzymes. ProSite motifs are either regular expressions (of average length 18.3 amino acids) or weight matrices, while SPs are deterministic motifs (with average length of 8.4). We search for all appearances of ProSite regular expression motifs on enzymes. Each such appearance is noted on the enzyme sequence and checked whether it is also (partially) covered by an SP.
<xref ref-type="supplementary-material" rid="pcbi-0030167-sg006">Figure S6</xref>
compares the appearance of SPs and ProSite motifs on the data, and
<xref ref-type="supplementary-material" rid="pcbi-0030167-sg007">Figure S7</xref>
displays the relative coverage of ProSite motifs by SPs as function of the minimal percentage of amino acids belonging to the ProSite motif that are also located on SPs. Thus we find that if at least 40% of the amino acids of the ProSite motif also belong to SPs, which would be appropriate for an average SP to be located within an average ProSite motif, then SPs cover 48% of all ProSite motif occurrences. This may be compared with a random model (see
<xref ref-type="sec" rid="s4">Methods</xref>
) which covers on average only 24% of ProSite motif occurrences, with a standard deviation of 0.06%. This extremely significant result (400 s.d.) demonstrates that SPs carry information that is highly correlated with that of ProSite motifs.</p>
</sec>
<sec id="s2d">
<title>Biological Roles of Specific Peptides</title>
<sec id="s2d1">
<title>Coverage of active sites.</title>
<p>Next we turn to establishing some particular biological roles for SPs. First we investigate their coverage of active and binding sites. 42% of all enzymes in the Swiss Prot 48.3 database have annotations of loci of active sites and binding sites (single amino acids). For simplicity we will refer to both annotations as active sites. A few examples are shown in
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
B. Given these loci, we find that 65% of all active sites are covered by SPs. This can be compared with the coverage of random positions on the same enzyme sequences which, on average, is only 27% (off by 80 standard deviations, see
<xref ref-type="sec" rid="s4">Methods</xref>
). We also construct a non-redundant set by choosing only one enzyme for each EC number (i.e., EC class of level 4). The results, displayed in
<xref ref-type="table" rid="pcbi-0030167-t005">Table 5</xref>
, show some differences between the total and the non-redundant sets. Since the latter is unbiased, it should generalize better, and allow us to get a better estimate of active-site coverage had the annotations existed for all enzymes. This estimate is 12% and has very high statistical significance (zero
<italic>p</italic>
-value, see
<xref ref-type="sec" rid="s4">Methods</xref>
).</p>
<table-wrap id="pcbi-0030167-t005" content-type="1col" position="float">
<label>Table 5</label>
<caption>
<p>Occurrence of Specific Peptides on Active Sites</p>
</caption>
<graphic xlink:href="pcbi.0030167.t005"></graphic>
</table-wrap>
<p>As an example of these features in the data, we display in
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
B aligned subsequences of enzymes, belonging to the same third level but to two different fourth levels of the EC hierarchy: six out of 35 enzymes of 5.1.3.2 and seven out of 29 enzymes of 5.1.3.20. Shown are strings belonging to the sequences that include active sites and binding sites as indicated in Swiss-Prot annotations, and boldfaced substrings denoting SPs from our lists. Whereas in 5.1.3.20, most active sites are covered by SPs, this is not the case for the active site of 5.1.3.2. Nonetheless, it turns out from investigating spatial structures of these enzymes that RYFNV, an SP that appears in both groups, is located within the same pocket in which the active site resides. This may be regarded as an indication that RYFNV plays an important role in fostering the biological function of this enzyme.</p>
<p>An example stressing the relationships among SPs and spatial structures is presented in
<xref ref-type="fig" rid="pcbi-0030167-g002">Figure 2</xref>
. This enzyme contains many SPs. Two SPs cover the active site, one—HMVRNI—shares a pocket with the active site and the two binding sites, and another one—FHARFV—plays the role of RNA binding in this tRNA pseudouridine synthase I.</p>
<fig id="pcbi-0030167-g002" position="float">
<label>Figure 2</label>
<caption>
<title>SPs Occurrence on a Spatial Structure of an Enzyme</title>
<p>(A) 3-D display of enzyme P07649 (PDB code 1DJ0), belonging to 5.4.99.12, showing (1) an active site D at sequence location 60; (2) a binding site Y at location 118; (3) a binding site L at location 245. The active site is common to two SPs (4) containing (CAGRT(D)AGVH). Other shown SPs are (5) GQVVH at locations 67–71; (6) FHARF at 107–111, known to be a tentative RNA-binding peptide; (7) ENDFTS at 157–163; and (8) HMVRNI at 201–207, sharing a pocket with the active and binding sites. QVVH and ENDFTS belong to SP3, all other peptides belong to SP4.</p>
<p>(B) A different display of the same enzyme focuses on the pocket containing the active site. The relevant section of the sequence is shown, with red residues signifying active and binding sites, green residues corresponding to other amino acids residing in the pocket, and underlined residues corresponding to SPs.</p>
</caption>
<graphic xlink:href="pcbi.0030167.g002"></graphic>
</fig>
<p>FHARF is one example of previously discovered motifs [
<xref rid="pcbi-0030167-b022" ref-type="bibr">22</xref>
]. Some other examples are: a) GFGRIG (SP of 1.1.1.26) [
<xref rid="pcbi-0030167-b023" ref-type="bibr">23</xref>
], a conserved region of GAPDH that is active in the glycolytic pathway; b) HRDLKP (SP of 2.7.1.37) [
<xref rid="pcbi-0030167-b024" ref-type="bibr">24</xref>
], appearing in protein kinases; c) IFIDEID (SP of 3.6.4.3), the Walker B motif of ATPase [
<xref rid="pcbi-0030167-b025" ref-type="bibr">25</xref>
]; to name a few. However, most of the SPs have not been studied before.</p>
<p>These results raise the question how many SPs can be found in the neighborhood of active sites, as defined by the pockets in the spatial structures of enzymes. One is naturally tempted to assign importance to all SPs of this kind, not just those that carry the active site annotation (single amino acid). For this study we use the CASTp [
<xref rid="pcbi-0030167-b026" ref-type="bibr">26</xref>
] database, which lists all amino acids belonging to pockets appearing in spatial structures of proteins. We select 1,031 enzymes that possess pockets including active (or binding) site annotations. There are 8,860 SPs that occur on these enzymes, 31% of which lie within these “active pockets,” i.e., have at least four amino acids that reside in the pocket. Defining a background model (see
<xref ref-type="sec" rid="s4">Methods</xref>
) of random peptides selected for each event of an SP hitting an active pocket in a particular enzyme, we estimate that 11% of all SPs belong to events that pass an FDR limit [
<xref rid="pcbi-0030167-b027" ref-type="bibr">27</xref>
] of 0.05. Most of them (70%) do not contain an active site; hence, they are of potential interest for experimental verification of their importance in defining and maintaining the enzymatic function.
<xref ref-type="table" rid="pcbi-0030167-t006">Table 6</xref>
summarizes the results of this analysis. Further details of all significant events are presented in
<xref ref-type="supplementary-material" rid="pcbi-0030167-st006">Table S6</xref>
. All 1,910 listed SP occurrences on enzymes should be of high relevance to the biological functions of these enzymes, and the elimination of any one of them from the enzyme sequence on which it occurs should be deleterious to the function of that enzyme or to its stability.</p>
<table-wrap id="pcbi-0030167-t006" content-type="1col" position="float">
<label>Table 6</label>
<caption>
<p>Occurrence of SPs in Spatial Proximity to Active Sites</p>
</caption>
<graphic xlink:href="pcbi.0030167.t006"></graphic>
</table-wrap>
<p>SPs may also have biological roles that are not connected to active or binding sites. Examples are DNA and RNA binding, metal binding, protein–protein interactions, etc. Given the large number of SPs, we may look forward to a plethora of predictions.</p>
</sec>
</sec>
<sec id="s2e">
<title>Minimal SP Sets with Maximal Coverage</title>
<p>We started our study with 50,698 enzymes from which 52,365 SPs were extracted. These SPs provided coverage of about 93% of all enzymes. By introducing further screening of SPs according to biological findings, a much reduced number of SPs may suffice for the purpose of classification. 21,228 enzymes carry active or binding site annotations in the 48.3 data. The number of SPs hitting these enzymes is 26,931; however, only 2,337 cover the active or binding sites. These 2,337 are found to occur on 79% of the 21,228 enzymes. Thus, instead of the approximately 1:1 ratio between the number of SPs and the number of enzymes they cover as found previously, we now obtain an order of magnitude parsimonious ratio, of about 1:8, while maintaining a similar level of classification accuracy.</p>
<p>The same SPs cover 36% of all original enzymes of our dataset. Performing a similar analysis on the 45 data, one finds that the 2,014 SPs that cover the annotated enzymes in it hit 75% of the relevant set of enzymes. Moreover, using the same SPs to classify the 10,585 novel enzymes contained in the 48.3 release and absent from the 45 release, one obtains coverage of 28% of them. This last fact demonstrates that the relatively large coverage reached by the small fraction of SPs that hit active sites is not limited to the dataset (training set) used to define the SPs. All these results are summarized in
<xref ref-type="table" rid="pcbi-0030167-t007">Table 7</xref>
. It seems therefore quite reasonable to conclude that, adding information of biological markers, one can reduce the ratio of the number of SPs deduced from a certain number of enzymes and needed to label their EC classification from 1:1 to about 1:8.</p>
<table-wrap id="pcbi-0030167-t007" content-type="1col" position="float">
<label>Table 7</label>
<caption>
<p>Small Sets of SPs that Contain Active Sites Suffice To Specify Functionality of Many Enzymes</p>
</caption>
<graphic xlink:href="pcbi.0030167.t007"></graphic>
</table-wrap>
<p>This, however, does not mean that all other SPs should be disregarded. First, there exist good chances that they are of biological importance for various structural and functional reasons that may warrant further investigation. Second, when extreme classification issues come up, as in the cases displayed in
<xref ref-type="table" rid="pcbi-0030167-t004">Table 4</xref>
, every single SP may count.</p>
</sec>
</sec>
<sec id="s3">
<title>Discussion</title>
<p>Conventional wisdom attributes protein functions to large domains, as well as to specific amino acids at strategic structural points on the protein. Large-scale studies often make use of multiple sequence alignment (MSA), phylogenetic information, and sophisticated mathematical models, thus leading to the plethora of algorithms and Web tools that permeate bioinformatics. While all that may be necessary to obtain a thorough understanding of the way proteins develop and perform, much can be gained by shifting attention to deterministic linear motifs on proteins. In doing so, we return to a way that has been often tried in the past. Thus, in the 1990s, many investigations looked for k-mers that are over-represented in sequences of proteins that have common functional properties. Some examples are ProSite [
<xref rid="pcbi-0030167-b008" ref-type="bibr">8</xref>
,
<xref rid="pcbi-0030167-b012" ref-type="bibr">12</xref>
], with which we have compared our results, and papers such as [
<xref rid="pcbi-0030167-b028" ref-type="bibr">28</xref>
<xref rid="pcbi-0030167-b030" ref-type="bibr">30</xref>
], where major emphasis has been put on finding a complete dictionary of motifs that cover all strings of amino acids that are of any importance. In the case of [
<xref rid="pcbi-0030167-b030" ref-type="bibr">30</xref>
], the search has been an unsupervised one leading eventually to a coverage of 98% of all amino acids on the protein strings. Some reviews of the motif approaches of the 1990s are [
<xref rid="pcbi-0030167-b007" ref-type="bibr">7</xref>
,
<xref rid="pcbi-0030167-b009" ref-type="bibr">9</xref>
]. More recently, interests have shifted to automated prediction tools that may make use of motifs but are not limited to them. Examples are the GOtcha method [
<xref rid="pcbi-0030167-b031" ref-type="bibr">31</xref>
] that uses sequence-identity searches of various genomes to predict functional annotation, and [
<xref rid="pcbi-0030167-b032" ref-type="bibr">32</xref>
] who pursue the same goal using PSI-BLAST searches with varying resolution.</p>
<p>Our goal is more moderate, restricting ourselves to the functional classification of enzymes. By doing so, and by applying the MEX algorithm together with limiting ourselves to SPs within the EC hierarchy, we are able to classify all enzymes by SPs occurring on them with coverage between 87% to 93%, depending on the EC level that is being looked for (
<xref ref-type="table" rid="pcbi-0030167-t001">Table 1</xref>
). Classification success of novel sequences that belong to the same type of data has coverage of 84% and precision of 99% at the SP level and 82% at the enzyme level (
<xref ref-type="table" rid="pcbi-0030167-t002">Table 2</xref>
). Restricting ourselves to low bias (
<xref ref-type="table" rid="pcbi-0030167-t003">Table 3</xref>
), we still have a large precision of 88% at the SP level. We have demonstrated that our results surpass the classification accuracy of sequence similarity (using Smith-Waterman [
<xref rid="pcbi-0030167-b033" ref-type="bibr">33</xref>
]), and our SPs have a higher coverage than ProSite motifs. As such, they become a powerful tool that may be added to existing automated searches.</p>
<p>It should be noted that the SPs were extracted by an unsupervised motif search algorithm, applied to each one of the six EC classes. This is quite different from conventional supervised approaches. Our method may disregard motifs that obey some over-representation criterion, and choose others that do not satisfy such a global statistics measure. Another major difference from other approaches is that we do not make use of MSA. MEX finds significant motifs without requiring alignment as a preprocessing stage. In fact, MEX can serve as a source for MSA by employing its motifs for alignment (see
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
B).</p>
<p>SPs were selected from all MEX motifs by imposing the condition that they should be specific to particular levels of the EC hierarchy. This has led to a large number of SPs, as numerous as the set of all enzymes (but, obviously, providing a much more concise description). Imposing further biological conditions, one may find much smaller sets that suffice for classification. In an analysis of enzymes for which the active sites are known, we have shown that the set of SPs bearing these active sites, which comprises just 8.6% of all relevant SPs (i.e., those occurring anywhere on these enzymes), suffices to cover (and therefore label) all enzymes.</p>
<p>Conventional classification methods rely on homology. While large homology is also at the root of our success for most SPs of level 4 (see some examples in
<xref ref-type="fig" rid="pcbi-0030167-g001">Figure 1</xref>
B), we have demonstrated (in
<xref ref-type="table" rid="pcbi-0030167-t004">Table 4</xref>
) that SPs can also be of importance in extreme cases, where straightforward comparison of an enzyme to another one with large sequence similarity may be misleading.</p>
<p>In conclusion, we have established a comprehensive and accurate classification scheme for enzymes based on the occurrence of short peptides on their sequences. The SPs contain, on average, just 8.4 amino acids, yet they suffice to correctly classify an overwhelming majority of known enzymes. Moreover, we have found indications for some of the biological roles of SPs, e.g., covering a majority of active sites. This study has laid the foundations for the further experimental investigation of these intriguing sets of SPs.</p>
</sec>
<sec sec-type="methods" id="s4">
<title>Methods</title>
<sec id="s4a">
<title>Motif extraction.</title>
<p>MEX is a motif extraction algorithm that serves as the basic unit of ADIOS [
<xref rid="pcbi-0030167-b015" ref-type="bibr">15</xref>
], an unsupervised method for extraction of syntax from linguistic corpora. We apply it to the problem of finding sequence motifs in enzymes.</p>
<p>Each enzyme sequence is represented as a path over a graph containing 20 vertices, each vertex representing one amino acid. After uploading all enzyme sequences onto the graph, one counts the number of paths connecting vertices in order to define probabilities such as</p>
<p>p(e
<sub>j</sub>
|e
<sub>i</sub>
) = (number of paths proceeding from e
<sub>i</sub>
to e
<sub>j</sub>
) / (total number of paths leaving e
<sub>i</sub>
)</p>
<p>p(e
<sub>k</sub>
|e
<sub>j</sub>
,e
<sub>i</sub>
) = ( number of paths proceeding from e
<sub>i</sub>
to e
<sub>j</sub>
to e
<sub>k</sub>
) / (number of paths proceeding from e
<sub>i</sub>
to e
<sub>j</sub>
)</p>
<p>for all vertices e
<sub>i</sub>
of the graph. These data-driven probabilities allow for the definition of a position-dependent variable-order Markov model describing the data.</p>
<p>A motif that is extracted by MEX is a subpath along the graph defined by probability-based criteria that account for convergence of many paths into the beginning point of a motif, and divergence of many paths from the endpoint of the motif. Motifs are not constrained by length, and may overlap with one another (see, e.g., the two SPs that overlap at the active site D in
<xref ref-type="fig" rid="pcbi-0030167-g002">Figure 2</xref>
B). The only two parameters of MEX are η, specifying a decrease in probability measures that determine convergence and divergence, and α specifying their statistical significance. For more details, see [
<xref rid="pcbi-0030167-b015" ref-type="bibr">15</xref>
] and
<ext-link ext-link-type="uri" xlink:href="http://adios.tau.ac.il">http://adios.tau.ac.il</ext-link>
. Throughout this paper, we use η = 0.9 and α = 0.01.</p>
</sec>
<sec id="s4b">
<title>Data.</title>
<p>Protein sequences annotated with EC numbers were extracted from the Swiss-Prot database (Release 48.3, 25 October 2005). To obtain a high-quality, well-defined training dataset, the data were strictly screened and the following sequences were removed: sequences shorter than 100 amino acids or longer than 1,200 amino acids, sequences with uncertain annotation, and enzymes that catalyze more than one reaction (e.g., have more than one EC number).</p>
</sec>
<sec id="s4c">
<title>Random model for SP hits on ProSite motifs.</title>
<p>Enzyme sequences are searched for matches with regular expressions of ProSite motifs. The resulting strings of amino acids are checked for matches with SPs. The latter are compared with matches of a random model where, for each given enzyme, random peptides are selected with the same lengths as those of the SPs that hit this enzyme. The random model provides a probability distribution which serves as a zero model for calculating the significance of the SP hit on the ProSite motif. This comparison is being made for each enzyme and for varying fractions of amino acids that are shared by the SP with the ProSite motif.</p>
</sec>
<sec id="s4d">
<title>Significance of SP hits on active sites.</title>
<p>In analyzing the significance of SP coverage of active (and binding) sites, we compare this coverage with that of randomly chosen residues on enzyme sequences. This is carried out on all data (i.e., annotated enzymes with SP hits) and on a non-redundant set composed of only one enzyme from each EC number (i.e., EC classification at level 4). The deviations of the measurements from random distributions are very high, and are quoted in numbers of standard deviations. The corresponding
<italic>p</italic>
-values are zero according to Matlab accuracy, i.e., are well bellow 10
<sup>−308</sup>
.</p>
</sec>
<sec id="s4e">
<title>Significance of SP residing in active pockets.</title>
<p>Let us define an event as the occurrence of a given SP within an active pocket in a given enzyme. For each such event, we evaluate the probability that at least one of randomly selected sequences from this enzyme, which coincide in length with the various SPs that occur on this enzyme, lies (with at least four amino acids) within the active pocket. This defines the
<italic>p</italic>
-value that we assign to the event. We then select the significant events according to an FDR limit [
<xref rid="pcbi-0030167-b033" ref-type="bibr">33</xref>
] of 0.05.</p>
</sec>
</sec>
<sec sec-type="supplementary-material" id="s5">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pcbi-0030167-sg001">
<label>Figure S1</label>
<caption>
<title>SP Length Distribution</title>
<p>(377 KB JPG)</p>
</caption>
<media xlink:href="pcbi.0030167.sg001.jpg">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-sg002">
<label>Figure S2</label>
<caption>
<title>Distribution of the Numbers of SPs Occurring on Enzymes</title>
<p>(500 KB JPG)</p>
</caption>
<media xlink:href="pcbi.0030167.sg002.jpg">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-sg003">
<label>Figure S3</label>
<caption>
<title>Distribution of Percentages of Sequence Identity for Pairs of Enzymes Sharing the Same SP3 or SP4 of Length Less Than Nine Amino Acids</title>
<p>(172 KB JPG)</p>
</caption>
<media xlink:href="pcbi.0030167.sg003.jpg">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-sg004">
<label>Figure S4</label>
<caption>
<title>Distribution of Percentages of Sequence Identity for Sets of Enzymes That Share the Same SP3 or SP4 of Length between 9 and 12</title>
<p>(156 KB JPG)</p>
</caption>
<media xlink:href="pcbi.0030167.sg004.jpg">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-sg005">
<label>Figure S5</label>
<caption>
<title>Relation of Enzymes in Two Swiss-Prot Releases, 45 (October 2004) and 48.3 (October 2005)</title>
<p>(118 KB JPG)</p>
</caption>
<media xlink:href="pcbi.0030167.sg005.jpg">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-sg006">
<label>Figure S6</label>
<caption>
<title>Data Coverage by ProSite Regular Expression Motifs and by SPs in the Swiss-Prot Database</title>
<p>(183 KB JPG)</p>
</caption>
<media xlink:href="pcbi.0030167.sg006.jpg">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-sg007">
<label>Figure S7</label>
<caption>
<title>Coverage of ProSite Motifs by SPs versus the Required Minimal Amount of Amino Acids Shared by the Two Motifs</title>
<p>For each ProSite motif (of average length 18 amino acids) occurrence on an annotated enzyme, SP matches were searched. The cumulative percentage of ProSite motifs that are covered by SPs is plotted as a function of the relative amount of coverage, i.e., the percent of the number of amino acids belonging to the ProSite motif that is shared by the SP. This is compared with the coverage of ProSite motifs by random motifs that have the same length and number as the SPs appearing on the enzymes</p>
<p>(360 KB JPG)</p>
</caption>
<media xlink:href="pcbi.0030167.sg007.jpg">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-st001">
<label>Table S1</label>
<caption>
<title>Coverage by SPs of Enzymes in Swiss-Prot Release 48.3</title>
<p>(30 KB DOC)</p>
</caption>
<media xlink:href="pcbi.0030167.st001.doc">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-st002">
<label>Table S2</label>
<caption>
<title>Comparison between Swiss-Prot Annotations and SP Predictions for Doubly Annotated Enzymes</title>
<p>Columns indicate the protein ID according to Swiss-Prot, its two EC assignments, the EC assignments according to SP predictions, and the number of SP matches that have the same EC prediction (separated into correct and false predictions). An analysis of the data shows that predictions that are based on a single SP match in the enzyme sequence are often wrong (122 false predictions versus 80 true predictions). The appearance of two SPs whose EC assignments are consistent with each other leads to 19 true predictions and five false predictions. All predictions based on more than two consistent SPs are true. When counting enzymes (rather than SPs), we find that 92 of 157 had one false prediction and no true prediction. 48 enzymes have one false prediction; 31 of them have also one true prediction, and 17 have two true predictions. 65 enzymes have no false prediction; 43 of them have one true prediction and 22 have two true predictions. It should be noted that this list of enzymes contains many related enzymes (i.e., it has high bias), hence successes and failures in different enzymes are correlated. It seems safe, however, to conclude that predictions based on several SPs whose EC assignments are consistent with each other may be trusted.</p>
<p>(808 KB DOC)</p>
</caption>
<media xlink:href="pcbi.0030167.st002.doc">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-st003">
<label>Table S3</label>
<caption>
<title>Numbers of Enzymes in Swiss-Prot Release 48.3 and Swiss-Prot Release 45</title>
<p>(34 KB DOC)</p>
</caption>
<media xlink:href="pcbi.0030167.st003.doc">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-st004">
<label>Table S4</label>
<caption>
<title>Comparison of SP with Smith-Waterman Performance on Classification at the Subclass Level</title>
<p>Classification based on SPs has been compared with classification based on sequence similarity using the Smith-Waterman (SW) method. This has been performed on the oxidoreductases data of the 48.3 release, using all subclasses and sub-subclasses that contain more than 20 enzyme sequences. The data were randomly partitioned into 75% training and 25% test sets. Features for the SP classification were determined by running MEX on all oxidoreductases and checking for their specificity using the training data only. These SPs were then used for defining, through training, the SP-SVM. Smith-Waterman analysis was carried out by defining a log(
<italic>p</italic>
-value) (with cutoff at p = e−06) distance matrix whose columns (features) were all oxidoreductases. The rows (instances) of the training-set enzymes were used to determine the SW-SVM classifications. 45 different partitions were performed to accumulate statistics. Same partitions were applied to both classification methods. Classification was performed using a soft-margin linear SVM, available online at
<ext-link ext-link-type="uri" xlink:href="http://svmlight.joachims.org">http://svmlight.joachims.org</ext-link>
.</p>
<p>Performance was measured by the Jaccard score J = TP / (TP + FP + FN).</p>
<p>(58 KB DOC)</p>
</caption>
<media xlink:href="pcbi.0030167.st004.doc">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-st005">
<label>Table S5</label>
<caption>
<title>Comparison of SP with Smith-Waterman Performance on Classification at the Sub-Subclass Level</title>
<p>(99 KB DOC)</p>
</caption>
<media xlink:href="pcbi.0030167.st005.doc">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="pcbi-0030167-st006">
<label>Table S6</label>
<caption>
<title>List of SPs That Lie in Active Pockets</title>
<p>A list of all events of SPs lying in active pockets that have passed the FDR = 0.05 limit, ordered according to their
<italic>p</italic>
-values. Entries include the enzyme PDB ID and the details of the SP.</p>
<p>(2.1 MB DOC)</p>
</caption>
<media xlink:href="pcbi.0030167.st006.doc">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>We thank Nir Ben-Tal, Assaf Gottlieb, Rachel Kolodny, Martin Kupiec, Ruth Nussinov, Yanay Ofran, Tal Pupko, Burkhard Rost, Roded Sharan, and Roy Varshavsky for comments and helpful discussions. We thank Joe Dundas and the CASTp team for making some of their data available to us. BS is supported by the Yeshaya Horowitz Association through the Center of Complexity Science.</p>
</ack>
<fn-group>
<fn id="ack3" fn-type="COI-statement">
<p>
<bold>Competing interests.</bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn id="n102" fn-type="previously-at">
<p>A previous version of this article appeared as an Early Online Release on July 11, 2007 (doi:
<ext-link ext-link-type="doi" xlink:href="10.1371/journal.pcbi.0030167.eor">10.1371/journal.pcbi.0030167.eor</ext-link>
).</p>
</fn>
<fn id="ack1" fn-type="con">
<p>
<bold>Author contributions.</bold>
ER and DH conceived and designed the computational work. VK and ZS performed the computational work. VK, YM, and DH analyzed the data. ZS, BS, and UW contributed analysis tools. VK, YM, ER, and DH wrote the paper.</p>
</fn>
<fn id="ack2" fn-type="financial-disclosure">
<p>
<bold>Funding.</bold>
This research was partially supported by the US–Israel Binational Science Foundation.</p>
</fn>
</fn-group>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>EC</term>
<def>
<p>Enzyme Commission</p>
</def>
</def-item>
<def-item>
<term>MSA</term>
<def>
<p>multiple sequence alignment</p>
</def>
</def-item>
<def-item>
<term>SP</term>
<def>
<p>specific peptides</p>
</def>
</def-item>
<def-item>
<term>SVM</term>
<def>
<p>support vector machine</p>
</def>
</def-item>
</def-list>
</glossary>
<ref-list>
<title>References</title>
<ref id="pcbi-0030167-b001">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Domingues</surname>
<given-names>FS</given-names>
</name>
<name>
<surname>Lengauer</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Protein function from sequence and structure data</article-title>
<source>Appl Bioinformatics</source>
<volume>2</volume>
<fpage>3</fpage>
<lpage>12</lpage>
<pub-id pub-id-type="pmid">15130830</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b002">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rost</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Yachdav</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>The predictprotein server</article-title>
<source>Nucleic Acids Res</source>
<volume>32</volume>
<fpage>W321</fpage>
<lpage>W326</lpage>
<pub-id pub-id-type="pmid">15215403</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b003">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tian</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Skolnick</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>How well is enzyme function conserved as a function of pairwise sequence identity?</article-title>
<source>J Mol Biol</source>
<volume>333</volume>
<fpage>863</fpage>
<lpage>882</lpage>
<pub-id pub-id-type="pmid">14568541</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b004">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hegyi</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Gerstein</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>The relationship between protein structure and function: A comprehensive survey with application to the yeast genome</article-title>
<source>J Mol Biol</source>
<volume>288</volume>
<fpage>147</fpage>
<lpage>164</lpage>
<pub-id pub-id-type="pmid">10329133</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b005">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rost</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2002</year>
<article-title>Enzyme function less conserved than anticipated</article-title>
<source>J Mol Biol</source>
<volume>318</volume>
<fpage>595</fpage>
<lpage>608</lpage>
<pub-id pub-id-type="pmid">12051862</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b006">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>von Grotthuss</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Plewczynski</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ginalsky</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Rychlewski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Shakhnovich</surname>
<given-names>EI</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>PDB-UF: Database of predicted enzymatic functions for unannotated protein structures from structural genomics</article-title>
<source>BMC Bioinformatics</source>
<volume>7</volume>
<fpage>53</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="pmid">16460560</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b007">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>
<year>1996</year>
<article-title>Protein sequence motifs</article-title>
<source>Curr Op Struct Biol</source>
<volume>6</volume>
<fpage>366</fpage>
<lpage>376</lpage>
</element-citation>
</ref>
<ref id="pcbi-0030167-b008">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bairoch</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bucher</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Hofmann</surname>
<given-names>K</given-names>
</name>
</person-group>
<year>1997</year>
<article-title>Prosite</article-title>
<source>Nucleic Acids Res</source>
<volume>25</volume>
<fpage>217</fpage>
<lpage>221</lpage>
<pub-id pub-id-type="pmid">9016539</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b009">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aitken</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Protein consensus sequence motifs</article-title>
<source>Mol Biotechnol</source>
<volume>12</volume>
<fpage>241</fpage>
<lpage>253</lpage>
<pub-id pub-id-type="pmid">10631681</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b010">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Neville-Manning</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>TD</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
</person-group>
<year>1998</year>
<article-title>Highly specific protein sequence motifs for genome analysis</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>95</volume>
<fpage>5865</fpage>
<lpage>5871</lpage>
<pub-id pub-id-type="pmid">9600885</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b011">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>JY</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>DL</given-names>
</name>
</person-group>
<year>2001</year>
<article-title>The emotif database</article-title>
<source>Nucleic Acids Res</source>
<volume>29</volume>
<fpage>202</fpage>
<lpage>204</lpage>
<pub-id pub-id-type="pmid">11125091</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b012">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Falquet</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Pagni</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bucher</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Hulo</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sigrist</surname>
<given-names>CJ</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>The ProSite database, its status in 2002</article-title>
<source>Nucleic Acids Res</source>
<volume>30</volume>
<fpage>235</fpage>
<lpage>238</lpage>
<pub-id pub-id-type="pmid">11752303</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b013">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tong</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>Drees</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Nardelli</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bader</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Brannetti</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<year>2002</year>
<article-title>A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules</article-title>
<source>Science</source>
<volume>295</volume>
<fpage>321</fpage>
<lpage>324</lpage>
<pub-id pub-id-type="pmid">11743162</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b014">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Obenauer</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Yaffe</surname>
<given-names>MB</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Computational prediction of protein–protein interactions</article-title>
<source>Methods Mol Biol</source>
<volume>261</volume>
<fpage>445</fpage>
<lpage>468</lpage>
<pub-id pub-id-type="pmid">15064475</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b015">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Solan</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Horn</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ruppin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Edelman</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Unsupervised learning of natural languages</article-title>
<source>Proc Natl Acad Sci U S A</source>
<volume>102</volume>
<fpage>11629</fpage>
<lpage>11634</lpage>
<pub-id pub-id-type="pmid">16087885</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b016">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ben-Hur</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Protein sequence motifs: Highly predictive features of protein function</article-title>
<person-group person-group-type="editor">
<name>
<surname>Guyon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Gunn</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Nikravesh</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zadeh</surname>
<given-names>L</given-names>
</name>
</person-group>
<source>Feature extraction, foundations and applications</source>
<publisher-loc>Berlin</publisher-loc>
<publisher-name>Springer Verlag</publisher-name>
</element-citation>
</ref>
<ref id="pcbi-0030167-b017">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liao</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Combining pairwise sequence analysis and support vector machines for detecting remote protein evolutionary and structural relationships</article-title>
<source>J Comp Biol</source>
<volume>10</volume>
<fpage>857</fpage>
<lpage>868</lpage>
</element-citation>
</ref>
<ref id="pcbi-0030167-b018">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cai</surname>
<given-names>CZ</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>LY</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>ZL</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>YZ</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>SVM-PROT: Web-based support vector machine software for functional classification of a protein from its primary sequence</article-title>
<source>Nucleic Acids Res</source>
<volume>31</volume>
<fpage>3692</fpage>
<lpage>3697</lpage>
<pub-id pub-id-type="pmid">12824396</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b019">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cai</surname>
<given-names>CZ</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>LY</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>ZL</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>YZ</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Enzyme family classification by support vector machines</article-title>
<source>Proteins</source>
<volume>55</volume>
<fpage>66</fpage>
<lpage>76</lpage>
<pub-id pub-id-type="pmid">14997540</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b020">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Schaffer</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Zhan</surname>
<given-names>JZ</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<year>1997</year>
<article-title>Gapped blast and psi-blst: A new generation of protein database search programs</article-title>
<source>Nucleic Acids Res</source>
<volume>25</volume>
<fpage>3389</fpage>
<lpage>3402</lpage>
<pub-id pub-id-type="pmid">9254694</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b021">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ben-Hur</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Brutlag</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Remote homology detection: A motif based approach</article-title>
<source>Bioinformatics</source>
<volume>19</volume>
<issue>(Supplement 1)</issue>
<fpage>i26</fpage>
<lpage>33</lpage>
<pub-id pub-id-type="pmid">12855434</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b022">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Foster</surname>
<given-names>PG</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Santi</surname>
<given-names>DV</given-names>
</name>
<name>
<surname>Stroud</surname>
<given-names>RM</given-names>
</name>
</person-group>
<year>2000</year>
<article-title>The structural basis for trna recognition and pseudouridine formation by pseudouridine synthase I</article-title>
<source>Nat Struct Biol</source>
<volume>7</volume>
<fpage>23</fpage>
<lpage>27</lpage>
<pub-id pub-id-type="pmid">10625422</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b023">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Anda</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gebbia</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Backenson</surname>
<given-names>PB</given-names>
</name>
<name>
<surname>Coleman</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Benach</surname>
<given-names>JL</given-names>
</name>
</person-group>
<year>1996</year>
<article-title>A glyceraldehyde-3-phosphate dehydrogenase homolog in
<named-content content-type="genus-species">Borrelia burgdorferi</named-content>
and
<named-content content-type="genus-species">Borrelia hermsii</named-content>
</article-title>
<source>Infect Immun</source>
<volume>64</volume>
<fpage>262</fpage>
<lpage>268</lpage>
<pub-id pub-id-type="pmid">8557349</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b024">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hanks</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Quinn</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Hunter</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>1988</year>
<article-title>The protein kinase family: Conserved features and deduced phylogeny of the catalytic domains</article-title>
<source>Science</source>
<volume>241</volume>
<fpage>42</fpage>
<lpage>52</lpage>
<pub-id pub-id-type="pmid">3291115</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b025">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Walker</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Saraste</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Runswick</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Gay</surname>
<given-names>NJ</given-names>
</name>
</person-group>
<year>1982</year>
<article-title>Distantly related sequences in the alpha- and beta-subunits of atp synthase, myosin, kinases and other atp-requiring enzymes and a common nucleotide binding fold</article-title>
<source>EMBO J</source>
<volume>1</volume>
<fpage>945</fpage>
<lpage>951</lpage>
<pub-id pub-id-type="pmid">6329717</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b026">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Binkowski</surname>
<given-names>TA</given-names>
</name>
<name>
<surname>Naghibzadeg</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Castp: Computed atlas of surface topography of proteins</article-title>
<source>Nucleic Acid Res</source>
<volume>31</volume>
<fpage>3352</fpage>
<lpage>3355</lpage>
<pub-id pub-id-type="pmid">12824325</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b027">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benjamini</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hochberg</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>1995</year>
<article-title>Controlling the false discovery rate: A practical and powerful approach to multiple testing</article-title>
<source>J Roy Stat Soc</source>
<volume>57</volume>
<fpage>289</fpage>
<lpage>300</lpage>
</element-citation>
</ref>
<ref id="pcbi-0030167-b028">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ogiwara</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Uchiyama</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Seto</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Kanehisa</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>1992</year>
<article-title>Construction of a dictionary of sequence motifs that characterize groups of related proteins</article-title>
<source>Protein Eng</source>
<volume>5</volume>
<fpage>479</fpage>
<lpage>488</lpage>
<pub-id pub-id-type="pmid">1438158</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b029">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>JTL</given-names>
</name>
<name>
<surname>Marr</surname>
<given-names>TG</given-names>
</name>
<name>
<surname>Shasha</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Shapiro</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Chirn</surname>
<given-names>GW</given-names>
</name>
</person-group>
<year>1994</year>
<article-title>Discovering active motifs in sets of related protein sequences and using them for classification</article-title>
<source>Nucleic Acids Res</source>
<volume>14</volume>
<fpage>2769</fpage>
<lpage>2775</lpage>
</element-citation>
</ref>
<ref id="pcbi-0030167-b030">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rigoutsos</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Floratos</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Parida</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins</article-title>
<source>Proteins</source>
<volume>37</volume>
<fpage>264</fpage>
<lpage>277</lpage>
<pub-id pub-id-type="pmid">10584071</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b031">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Martin</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Berriman</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Barton</surname>
<given-names>GJ</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>GOtcha: A new method for prediction of protein function assessed by the annotation of seven genomes</article-title>
<source>BMC Bioinformatics</source>
<volume>5</volume>
<fpage>178</fpage>
<pub-id pub-id-type="pmid">15550167</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b032">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hawkins</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Luban</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kihara</surname>
<given-names>D</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Enhanced automated function prediction using distantly related sequences and contextual association by PFP</article-title>
<source>Protein Sci</source>
<volume>15</volume>
<fpage>1550</fpage>
<lpage>1556</lpage>
<pub-id pub-id-type="pmid">16672240</pub-id>
</element-citation>
</ref>
<ref id="pcbi-0030167-b033">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Smith</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>1981</year>
<article-title>Identification of common molecular subsequences</article-title>
<source>J Mol Biol</source>
<volume>147</volume>
<fpage>195</fpage>
<lpage>197</lpage>
<pub-id pub-id-type="pmid">7265238</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F82 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000F82 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:1950953
   |texte=   Functional Representation of Enzymes by Specific Peptides
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:17722976" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021