OrangerV1, Pmc, Corpus, bibRecord, 000A87

***** Acces problem to record *****\

Identifieur interne : 000A87 ( Pmc/Corpus ); précédent : 000A869; suivant : 000A880 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins</title>
<author><name sortKey="Verma, Ruchi" sort="Verma, Ruchi" uniqKey="Verma R" first="Ruchi" last="Verma">Ruchi Verma</name>
<affiliation><nlm:aff id="I1">Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK 74078 USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Melcher, Ulrich" sort="Melcher, Ulrich" uniqKey="Melcher U" first="Ulrich" last="Melcher">Ulrich Melcher</name>
<affiliation><nlm:aff id="I1">Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK 74078 USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">23046503</idno>
<idno type="pmc">3439722</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439722</idno>
<idno type="RBID">PMC:3439722</idno>
<idno type="doi">10.1186/1471-2105-13-S15-S9</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000A87</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins</title>
<author><name sortKey="Verma, Ruchi" sort="Verma, Ruchi" uniqKey="Verma R" first="Ruchi" last="Verma">Ruchi Verma</name>
<affiliation><nlm:aff id="I1">Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK 74078 USA</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Melcher, Ulrich" sort="Melcher, Ulrich" uniqKey="Melcher U" first="Ulrich" last="Melcher">Ulrich Melcher</name>
<affiliation><nlm:aff id="I1">Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK 74078 USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM).</p>
</sec>
<sec><title>Result</title>
<p>The amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity.</p>
</sec>
<sec><title>Conclusion</title>
<p>The results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Strange, Rn" uniqKey="Strange R">RN Strange</name>
</author>
<author><name sortKey="Scott, Pr" uniqKey="Scott P">PR Scott</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Emerson, D" uniqKey="Emerson D">D Emerson</name>
</author>
<author><name sortKey="Rentz, Ja" uniqKey="Rentz J">JA Rentz</name>
</author>
<author><name sortKey="Lilburn, Tg" uniqKey="Lilburn T">TG Lilburn</name>
</author>
<author><name sortKey="Davis, Re" uniqKey="Davis R">RE Davis</name>
</author>
<author><name sortKey="Aldrich, H" uniqKey="Aldrich H">H Aldrich</name>
</author>
<author><name sortKey="Chan, C" uniqKey="Chan C">C Chan</name>
</author>
<author><name sortKey="Moyer, Cl" uniqKey="Moyer C">CL Moyer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Melcher, U" uniqKey="Melcher U">U Melcher</name>
</author>
<author><name sortKey="Grover, V" uniqKey="Grover V">V Grover</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Fletcher, J" uniqKey="Fletcher J">J Fletcher</name>
</author>
<author><name sortKey="Bender, C" uniqKey="Bender C">C Bender</name>
</author>
<author><name sortKey="Budowle, B" uniqKey="Budowle B">B Budowle</name>
</author>
<author><name sortKey="Cobb, Wt" uniqKey="Cobb W">WT Cobb</name>
</author>
<author><name sortKey="Gold, Se" uniqKey="Gold S">SE Gold</name>
</author>
<author><name sortKey="Ishimaru, Ca" uniqKey="Ishimaru C">CA Ishimaru</name>
</author>
<author><name sortKey="Luster, D" uniqKey="Luster D">D Luster</name>
</author>
<author><name sortKey="Melcher, U" uniqKey="Melcher U">U Melcher</name>
</author>
<author><name sortKey="Murch, R" uniqKey="Murch R">R Murch</name>
</author>
<author><name sortKey="Scherm, H" uniqKey="Scherm H">H Scherm</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author><name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author><name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author><name sortKey="Myers, Ew" uniqKey="Myers E">EW Myers</name>
</author>
<author><name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Verma, R" uniqKey="Verma R">R Verma</name>
</author>
<author><name sortKey="Tiwari, A" uniqKey="Tiwari A">A Tiwari</name>
</author>
<author><name sortKey="Kaur, S" uniqKey="Kaur S">S Kaur</name>
</author>
<author><name sortKey="Varshney, Gc" uniqKey="Varshney G">GC Varshney</name>
</author>
<author><name sortKey="Raghava, Gp" uniqKey="Raghava G">GP Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kaundal, R" uniqKey="Kaundal R">R Kaundal</name>
</author>
<author><name sortKey="Raghava, Gp" uniqKey="Raghava G">GP Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
<author><name sortKey="Wong, Kk" uniqKey="Wong K">KK Wong</name>
</author>
<author><name sortKey="Young, Gs" uniqKey="Young G">GS Young</name>
</author>
<author><name sortKey="Guo, L" uniqKey="Guo L">L Guo</name>
</author>
<author><name sortKey="Wong, St" uniqKey="Wong S">ST Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Choi, S" uniqKey="Choi S">S Choi</name>
</author>
<author><name sortKey="Jiang, Z" uniqKey="Jiang Z">Z Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Magnin, B" uniqKey="Magnin B">B Magnin</name>
</author>
<author><name sortKey="Mesrob, L" uniqKey="Mesrob L">L Mesrob</name>
</author>
<author><name sortKey="Kinkingnehun, S" uniqKey="Kinkingnehun S">S Kinkingnehun</name>
</author>
<author><name sortKey="Pelegrini Issac, M" uniqKey="Pelegrini Issac M">M Pelegrini-Issac</name>
</author>
<author><name sortKey="Colliot, O" uniqKey="Colliot O">O Colliot</name>
</author>
<author><name sortKey="Sarazin, M" uniqKey="Sarazin M">M Sarazin</name>
</author>
<author><name sortKey="Dubois, B" uniqKey="Dubois B">B Dubois</name>
</author>
<author><name sortKey="Lehericy, S" uniqKey="Lehericy S">S Lehericy</name>
</author>
<author><name sortKey="Benali, H" uniqKey="Benali H">H Benali</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vert, Jp" uniqKey="Vert J">JP Vert</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Furey, Ts" uniqKey="Furey T">TS Furey</name>
</author>
<author><name sortKey="Cristianini, N" uniqKey="Cristianini N">N Cristianini</name>
</author>
<author><name sortKey="Duffy, N" uniqKey="Duffy N">N Duffy</name>
</author>
<author><name sortKey="Bednarski, Dw" uniqKey="Bednarski D">DW Bednarski</name>
</author>
<author><name sortKey="Schummer, M" uniqKey="Schummer M">M Schummer</name>
</author>
<author><name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dharmasaroja, P" uniqKey="Dharmasaroja P">P Dharmasaroja</name>
</author>
<author><name sortKey="Dharmasaroja, Pa" uniqKey="Dharmasaroja P">PA Dharmasaroja</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Naguib, Ia" uniqKey="Naguib I">IA Naguib</name>
</author>
<author><name sortKey="Darwish, Hw" uniqKey="Darwish H">HW Darwish</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dondoshansky, Iwy" uniqKey="Dondoshansky I">IWY Dondoshansky</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Joachims, T" uniqKey="Joachims T">T Joachims</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="O Dwyer, L" uniqKey="O Dwyer L">L O'Dwyer</name>
</author>
<author><name sortKey="Lamberton, F" uniqKey="Lamberton F">F Lamberton</name>
</author>
<author><name sortKey="Bokde, Al" uniqKey="Bokde A">AL Bokde</name>
</author>
<author><name sortKey="Ewers, M" uniqKey="Ewers M">M Ewers</name>
</author>
<author><name sortKey="Faluyi, Yo" uniqKey="Faluyi Y">YO Faluyi</name>
</author>
<author><name sortKey="Tanner, C" uniqKey="Tanner C">C Tanner</name>
</author>
<author><name sortKey="Mazoyer, B" uniqKey="Mazoyer B">B Mazoyer</name>
</author>
<author><name sortKey="O Neill, D" uniqKey="O Neill D">D O'Neill</name>
</author>
<author><name sortKey="Bartley, M" uniqKey="Bartley M">M Bartley</name>
</author>
<author><name sortKey="Collins, Dr" uniqKey="Collins D">DR Collins</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ansari, Hr" uniqKey="Ansari H">HR Ansari</name>
</author>
<author><name sortKey="Raghava, Gp" uniqKey="Raghava G">GP Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Baldi, P" uniqKey="Baldi P">P Baldi</name>
</author>
<author><name sortKey="Brunak, S" uniqKey="Brunak S">S Brunak</name>
</author>
<author><name sortKey="Chauvin, Y" uniqKey="Chauvin Y">Y Chauvin</name>
</author>
<author><name sortKey="Andersen, Ca" uniqKey="Andersen C">CA Andersen</name>
</author>
<author><name sortKey="Nielsen, H" uniqKey="Nielsen H">H Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Verma, R" uniqKey="Verma R">R Verma</name>
</author>
<author><name sortKey="Varshney, Gc" uniqKey="Varshney G">GC Varshney</name>
</author>
<author><name sortKey="Raghava, Gp" uniqKey="Raghava G">GP Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lu, Q" uniqKey="Lu Q">Q Lu</name>
</author>
<author><name sortKey="Cui, Y" uniqKey="Cui Y">Y Cui</name>
</author>
<author><name sortKey="Ye, C" uniqKey="Ye C">C Ye</name>
</author>
<author><name sortKey="Wei, C" uniqKey="Wei C">C Wei</name>
</author>
<author><name sortKey="Elston, Rc" uniqKey="Elston R">RC Elston</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="He, X" uniqKey="He X">X He</name>
</author>
<author><name sortKey="Frey, E" uniqKey="Frey E">E Frey</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chappell, Fm" uniqKey="Chappell F">FM Chappell</name>
</author>
<author><name sortKey="Raab, Gm" uniqKey="Raab G">GM Raab</name>
</author>
<author><name sortKey="Wardlaw, Jm" uniqKey="Wardlaw J">JM Wardlaw</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Algarabel, S" uniqKey="Algarabel S">S Algarabel</name>
</author>
<author><name sortKey="Pitarque, A" uniqKey="Pitarque A">A Pitarque</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Higashida, Y" uniqKey="Higashida Y">Y Higashida</name>
</author>
<author><name sortKey="Ideguchi, T" uniqKey="Ideguchi T">T Ideguchi</name>
</author>
<author><name sortKey="Muranaka, T" uniqKey="Muranaka T">T Muranaka</name>
</author>
<author><name sortKey="Tabata, N" uniqKey="Tabata N">N Tabata</name>
</author>
<author><name sortKey="Miyajima, R" uniqKey="Miyajima R">R Miyajima</name>
</author>
<author><name sortKey="Akazawa, F" uniqKey="Akazawa F">F Akazawa</name>
</author>
<author><name sortKey="Ikeda, H" uniqKey="Ikeda H">H Ikeda</name>
</author>
<author><name sortKey="Morimoto, K" uniqKey="Morimoto K">K Morimoto</name>
</author>
<author><name sortKey="Ohki, M" uniqKey="Ohki M">M Ohki</name>
</author>
<author><name sortKey="Toyofuku, F" uniqKey="Toyofuku F">F Toyofuku</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wiebringhaus, R" uniqKey="Wiebringhaus R">R Wiebringhaus</name>
</author>
<author><name sortKey="John, V" uniqKey="John V">V John</name>
</author>
<author><name sortKey="Muller, Rd" uniqKey="Muller R">RD Muller</name>
</author>
<author><name sortKey="Hirche, H" uniqKey="Hirche H">H Hirche</name>
</author>
<author><name sortKey="Voss, M" uniqKey="Voss M">M Voss</name>
</author>
<author><name sortKey="Callies, R" uniqKey="Callies R">R Callies</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Daures, Jp" uniqKey="Daures J">JP Daures</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hannequin, P" uniqKey="Hannequin P">P Hannequin</name>
</author>
<author><name sortKey="Liehn, Jc" uniqKey="Liehn J">JC Liehn</name>
</author>
<author><name sortKey="Delisle, Mj" uniqKey="Delisle M">MJ Delisle</name>
</author>
<author><name sortKey="Deltour, G" uniqKey="Deltour G">G Deltour</name>
</author>
<author><name sortKey="Valeyre, J" uniqKey="Valeyre J">J Valeyre</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Creelman, Cd" uniqKey="Creelman C">CD Creelman</name>
</author>
<author><name sortKey="Donaldson, W" uniqKey="Donaldson W">W Donaldson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Balakrishnan, N" uniqKey="Balakrishnan N">N Balakrishnan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zahr, N" uniqKey="Zahr N">N Zahr</name>
</author>
<author><name sortKey="Arnaud, L" uniqKey="Arnaud L">L Arnaud</name>
</author>
<author><name sortKey="Marquet, P" uniqKey="Marquet P">P Marquet</name>
</author>
<author><name sortKey="Haroche, J" uniqKey="Haroche J">J Haroche</name>
</author>
<author><name sortKey="Costedoat Chalumeau, N" uniqKey="Costedoat Chalumeau N">N Costedoat-Chalumeau</name>
</author>
<author><name sortKey="Hulot, Js" uniqKey="Hulot J">JS Hulot</name>
</author>
<author><name sortKey="Funck Brentano, C" uniqKey="Funck Brentano C">C Funck-Brentano</name>
</author>
<author><name sortKey="Piette, Jc" uniqKey="Piette J">JC Piette</name>
</author>
<author><name sortKey="Amoura, Z" uniqKey="Amoura Z">Z Amoura</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group><journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">23046503</article-id>
<article-id pub-id-type="pmc">3439722</article-id>
<article-id pub-id-type="publisher-id">1471-2105-13-S15-S9</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-13-S15-S9</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Proceedings</subject>
</subj-group>
</article-categories>
<title-group><article-title>A Support Vector Machine based method to distinguish proteobacterial proteins from eukaryotic plant proteins</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" id="A1"><name><surname>Verma</surname>
<given-names>Ruchi</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>ruchi.verma@okstate.edu</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A2"><name><surname>Melcher</surname>
<given-names>Ulrich</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>ulrich.melcher@okstate.edu</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
Department of Biochemistry and Molecular Biology, Oklahoma State University, Stillwater, OK 74078 USA</aff>
<pub-date pub-type="collection"><year>2012</year>
</pub-date>
<pub-date pub-type="epub"><day>11</day>
<month>9</month>
<year>2012</year>
</pub-date>
<volume>13</volume>
<issue>Suppl 15</issue>
<supplement><named-content content-type="supplement-title">Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge</named-content>
<named-content content-type="supplement-editor">Jonathan Wren, Susan Bridges, Doris Kupfer, Dennis Burian, Mikhail Dozmorov and Rakesh Kaundal</named-content>
</supplement>
<fpage>S9</fpage>
<lpage>S9</lpage>
<permissions><copyright-statement>Copyright ©2012 Verma and Melcher; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Verma and Melcher; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0"><license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/13/S15/S9"></self-uri>
<abstract><sec><title>Background</title>
<p>Members of the phylum Proteobacteria are most prominent among bacteria causing plant diseases that result in a diminution of the quantity and quality of food produced by agriculture. To ameliorate these losses, there is a need to identify infections in early stages. Recent developments in next generation nucleic acid sequencing and mass spectrometry open the door to screening plants by the sequences of their macromolecules. Such an approach requires the ability to recognize the organismal origin of unknown DNA or peptide fragments. There are many ways to approach this problem but none have emerged as the best protocol. Here we attempt a systematic way to determine organismal origins of peptides by using a machine learning algorithm. The algorithm that we implement is a Support Vector Machine (SVM).</p>
</sec>
<sec><title>Result</title>
<p>The amino acid compositions of proteobacterial proteins were found to be different from those of plant proteins. We developed an SVM model based on amino acid and dipeptide compositions to distinguish between a proteobacterial protein and a plant protein. The amino acid composition (AAC) based SVM model had an accuracy of 92.44% with 0.85 Matthews correlation coefficient (MCC) while the dipeptide composition (DC) based SVM model had a maximum accuracy of 94.67% and 0.89 MCC. We also developed SVM models based on a hybrid approach (AAC and DC), which gave a maximum accuracy 94.86% and a 0.90 MCC. The models were tested on unseen or untrained datasets to assess their validity.</p>
</sec>
<sec><title>Conclusion</title>
<p>The results indicate that the SVM based on the AAC and DC hybrid approach can be used to distinguish proteobacterial from plant protein sequences.</p>
</sec>
</abstract>
<kwd-group><kwd>proteobacteria</kwd>
<kwd>plant proteins</kwd>
<kwd>SVM</kwd>
<kwd>machine learning</kwd>
<kwd>amino acid composition</kwd>
<kwd>dipeptide composition</kwd>
</kwd-group>
<conference><conf-date>17-18 February 2012</conf-date>
<conf-name>Proceedings of the Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge</conf-name>
<conf-loc>Oxford, MS, USA</conf-loc>
</conference>
</article-meta>
</front>
<body><sec><title>Background</title>
<p>Bacterial plant pathogens are a major threat to global food security [<xref ref-type="bibr" rid="B1">1</xref>
]. Half of the bacterial species causing major food losses in the world belong to the major phylum Proteobacteria (Figure <xref ref-type="fig" rid="F1">1</xref>
). They are found predominantly in the class Gammaproteobacteria (<italic>Xanthomonas, Pseudomonas </italic>
and <italic>Erwinia</italic>
) and also in the class Betaproteobacteria (<italic>Ralstonia</italic>
). Gammaproteobacteria include in addition a wide variety of several medically, ecologically and scientifically important groups such as Enterobacteriaceae (<italic>Escherichia coli</italic>
), Vibrionaceae and Pseudomonadaceae. Also, beneficial bacteria, such as nitrogen fixing, ammonia oxidizing and iron fixing bacteria are members of this phylum. Betaproteobacteria also include ammonia oxidizing and arsenic resistant bacteria with Burkholderiales as one of the major classes. Alphaproteobacteria is dominated mostly by nitrogen-fixing bacteria and agrobacteria. Deltaproteobacteria and Epsilonproteobacteria have aerobic genera and curved to spirilloid <italic>Wolinella </italic>
spp., respectively. Zetaproteobacteria is composed of a sole member: <italic>Mariprofundus ferrooxydans </italic>
which oxidizes ferrous to ferric iron [<xref ref-type="bibr" rid="B2">2</xref>
].</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p><bold>The subgroups of proteobacteria and the main members of each subgroup</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S15-S9-1"></graphic>
</fig>
<p>Several methods are being developed to detect phytopathogens involving macromolecular sequencing, especially nucleotide sequencing [<xref ref-type="bibr" rid="B3">3</xref>
,<xref ref-type="bibr" rid="B4">4</xref>
]. With the advent of next generation sequencing, testing of diseased or quarantined plants for the presence of proteobacteria will rely increasingly on massive DNA sequencing. Peptide mass spectroscopy also shows promise in such screening. The analysis of nucleotide sequences typically involves assembly of sequence reads into contigs followed by analysis using Blast [<xref ref-type="bibr" rid="B5">5</xref>
] search to identify pathogen-derived contigs. This approach is limited in that it only identifies potential pathogens whose nucleotide sequences are included in the searched database. Thus, there is a strong need for methods to find the organismal origin of unknown DNA or peptide fragments to identify potential pathogen sequences. Machine learning techniques, such as support vector machines (SVMs) and neural networks have been used successfully to develop classifiers for a number of different biolgocial problems including predicting different categories of proteins [<xref ref-type="bibr" rid="B6">6</xref>
-<xref ref-type="bibr" rid="B14">14</xref>
]. As a first step towards detecting pathogenic bacteria spp., we evaluated whether a machine learning algorithm, SVM, could distinguish between proteobacterial (potential pathogen) and plant (host) proteins. Thus, we assembled datasets of proteobacterial and plant host proteins for this study. We focused on amino acid, rather than nucleotide residues, because of the greater variety of residues that can be present at any one position, allowing subtle evolutionary forces to play a role in shaping the protein sequence and its properties.</p>
</sec>
<sec sec-type="methods"><title>Methods</title>
<sec><title>Training datasets</title>
<p>Amino acid sequences of proteobacteria and plants were downloaded from the Uniprot website [UniProt release 2012_01-Jan 25, 2012] <ext-link ext-link-type="uri" xlink:href="http://www.uniprot.org/">http://www.uniprot.org/</ext-link>
. Only reviewed protein sequences were taken into consideration. A total of 3508 proteins (mean length, 322 ± 202) from nine species of proteobacteria (of which, three are phytopathogens) and 3206 proteins (mean length, 376 ± 308) from ten plant species were used initially for training. We used Blastclust [<xref ref-type="bibr" rid="B15">15</xref>
] to remove redundant proteins, defined as those having greater than a specified % identity (a % redundancy value) from the data. Redundancy filtering was performed both before and after combining proteins from different species. Datasets were constructed at 90%, 50% and 30% redundancy values. Thus, with the 90% redundancy set we obtained 3408 proteobacterial and 2631 plant host proteins. For the 50% and 30% redundancy sets we obtained 3230 proteobacterial proteins, 2284 plant host proteins and 3203 proteobacterial proteins, 2277 plant host proteins, respectively. As the goal of this study was to identify bacterial proteins, the proteobacterial protein set was taken as the positive class and the plant protein set as the negative class (Tables <xref ref-type="table" rid="T1">1</xref>
 and <xref ref-type="table" rid="T2">2</xref>
). Test and training sets were designed from a five-fold cross-validation to create a model for the classification of new sequences (Figure <xref ref-type="fig" rid="F2">2</xref>
). Thus each dataset was in both training and testing sets. To further validate the performance of our best-trained models, we tested the models on unseen/blind or untrained data not used for training the SVM. From Uniprot we downloaded non-redundant proteins for three species of proteobacteria (<italic>Serratia marcesens, Acidovorax citrulli, Rhizobium fredii</italic>
) and three plant species (<italic>Solanum lycopersicum, Phaseolus vulgaris, Cucurbita pepo</italic>
).</p>
<table-wrap id="T1" position="float"><label>Table 1</label>
<caption><p>Total number of pathogen proteins taken from Uniprot and number of proteins remaining after redundancy filtering at 3 different percentages.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Positive dataset (Pathogen)</th>
<th align="center">Total number of proteins (reviewed)</th>
<th align="center">90% redundancy</th>
<th align="center">50% redundancy</th>
<th align="center">30% redundancy</th>
</tr>
</thead>
<tbody><tr><td align="center"><italic>Agrobacterium tumefaciens (Rhizobium radiobacter)</italic>
</td>
<td align="center">104</td>
<td align="center">103</td>
<td align="center">103</td>
<td align="center">103</td>
</tr>
<tr><td align="center"><italic>Burkholderia phymatum</italic>
</td>
<td align="center">333</td>
<td align="center">333</td>
<td align="center">333</td>
<td align="center">333</td>
</tr>
<tr><td align="center"><italic>Pseudomonas aeruginosa </italic>
(ATCC)</td>
<td align="center">1217</td>
<td align="center">1216</td>
<td align="center">1211</td>
<td align="center">1211</td>
</tr>
<tr><td align="center"><italic>Xanthomonas oryzae pv. Oryzae</italic>
</td>
<td align="center">411</td>
<td align="center">410</td>
<td align="center">410</td>
<td align="center">410</td>
</tr>
<tr><td align="center"><italic>Ralstonia solanacearum</italic>
</td>
<td align="center">601</td>
<td align="center">601</td>
<td align="center">599</td>
<td align="center">599</td>
</tr>
<tr><td align="center"><italic>Rhizobium etli </italic>
(ATCC)</td>
<td align="center">424</td>
<td align="center">421</td>
<td align="center">421</td>
<td align="center">421</td>
</tr>
<tr><td align="center"><italic>Rhizobium meliloti</italic>
</td>
<td align="center">48</td>
<td align="center">47</td>
<td align="center">47</td>
<td align="center">47</td>
</tr>
<tr><td align="center"><italic>Methylobacterium nodulans</italic>
</td>
<td align="center">213</td>
<td align="center">213</td>
<td align="center">213</td>
<td align="center">213</td>
</tr>
<tr><td align="center"><italic>Desulfobacterales autotrophicum </italic>
(ATCC)</td>
<td align="center">157</td>
<td align="center">157</td>
<td align="center">157</td>
<td align="center">157</td>
</tr>
<tr><td align="center">Total</td>
<td align="center">3508</td>
<td align="center">3501</td>
<td align="center">3494</td>
<td align="center">3494</td>
</tr>
<tr><td align="center">Total after blastclust on cumulative data</td>
<td align="center">-</td>
<td align="center">3408</td>
<td align="center">3230</td>
<td align="center">3203</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T2" position="float"><label>Table 2</label>
<caption><p>Total number of plant proteins taken from Uniprot and number of proteins remaining after redundancy filtering at 3 different percentages.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Negative dataset (Plant host)</th>
<th align="center">Total number of proteins (reviewed)</th>
<th align="center">90% redundancy</th>
<th align="center">50% redundancy</th>
<th align="center">30% redundancy</th>
</tr>
</thead>
<tbody><tr><td align="center"><italic>Triticum aestivum</italic>
</td>
<td align="center">357</td>
<td align="center">315</td>
<td align="center">292</td>
<td align="center">291</td>
</tr>
<tr><td align="center"><italic>Oryza sativa</italic>
</td>
<td align="center">87</td>
<td align="center">86</td>
<td align="center">86</td>
<td align="center">86</td>
</tr>
<tr><td align="center"><italic>Solanum tuberosum</italic>
</td>
<td align="center">390</td>
<td align="center">314</td>
<td align="center">308</td>
<td align="center">308</td>
</tr>
<tr><td align="center"><italic>Arabidopsis thaliana</italic>
</td>
<td align="center">1000</td>
<td align="center">968</td>
<td align="center">857</td>
<td align="center">852</td>
</tr>
<tr><td align="center"><italic>Cucurbita maxima</italic>
</td>
<td align="center">26</td>
<td align="center">25</td>
<td align="center">25</td>
<td align="center">25</td>
</tr>
<tr><td align="center"><italic>Citrus sinensis</italic>
</td>
<td align="center">93</td>
<td align="center">91</td>
<td align="center">91</td>
<td align="center">91</td>
</tr>
<tr><td align="center"><italic>Vitis vinefera</italic>
</td>
<td align="center">161</td>
<td align="center">154</td>
<td align="center">152</td>
<td align="center">152</td>
</tr>
<tr><td align="center"><italic>Hordeum vulgare</italic>
</td>
<td align="center">348</td>
<td align="center">323</td>
<td align="center">307</td>
<td align="center">307</td>
</tr>
<tr><td align="center"><italic>Pisum sativum</italic>
</td>
<td align="center">371</td>
<td align="center">347</td>
<td align="center">335</td>
<td align="center">334</td>
</tr>
<tr><td align="center"><italic>Glycine max</italic>
</td>
<td align="center">373</td>
<td align="center">346</td>
<td align="center">324</td>
<td align="center">324</td>
</tr>
<tr><td align="center">Total</td>
<td align="center">3206</td>
<td align="center">2969</td>
<td align="center">2777</td>
<td align="center">2770</td>
</tr>
<tr><td align="center">Total after blastclust on cumulative data</td>
<td align="center">-</td>
<td align="center">2631</td>
<td align="center">2284</td>
<td align="center">2277</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p><bold>Construction of datasets using five-fold cross validation</bold>
. Pset is for positive dataset (proteobacteria) and Nset is for negative dataset (plants).</p>
</caption>
<graphic xlink:href="1471-2105-13-S15-S9-2"></graphic>
</fig>
</sec>
<sec><title>Feature Vectors used</title>
<p>Amino Acid Composition (AAC): Each protein was represented as a vector of 20 features, each corresponding to the fractional composition of an amino acid. This set of feature vectors was presented as input to SVM. Separate amino acid frequencies were calculated for both sets of proteins (proteobacteria and plants). The AAC was calculated by the following equation:</p>
<p><disp-formula><mml:math id="M1" name="1471-2105-13-S15-S9-i1" overflow="scroll"><mml:mrow><mml:mtext>Fraction</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mtext>of</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mtext>amino</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mtext>acid</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mtext>x</mml:mtext>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mi>T</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>n</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>o</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>T</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>n</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>o</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>s</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where x can be any amino acid residue.</p>
<p>Dipeptide Composition (DC): Each protein was represented as a vector of 400 features for the 20 × 20 possible combinations of amino acids. The DC was calculated by the following equation:</p>
<p><disp-formula><mml:math id="M2" name="1471-2105-13-S15-S9-i2" overflow="scroll"><mml:mrow><mml:mtext>Fraction</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mtext>of</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mtext>dipeptide</mml:mtext>
<mml:mfenced close=")" open="("><mml:mrow><mml:mtext>xy</mml:mtext>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mi>T</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>n</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>T</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>n</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where dipeptide (xy) is one of 400 possible dipeptides.</p>
<p>Hybrid (AAC+DC): The AAC and DC feature vectors were merged to yield feature vectors of 420 features (20+400).</p>
</sec>
<sec><title>Support Vector Machine</title>
<p>An SVM is a kernel-based margin classifier, which uses both statistics and optimization. It draws an optimal hyper-plane in a high dimensional feature space that defines a boundary that maximizes the margin between data samples in two classes, therefore giving a better generalization property (Figure <xref ref-type="fig" rid="F3">3</xref>
). Specifically, SVM<sup>light</sup>
, which is an implementation (in C language) of SVM, has been used in this study. The SVM<sup>light </sup>
package can be downloaded from <ext-link ext-link-type="uri" xlink:href="http://www.joachims.org">http://www.joachims.org</ext-link>
 for non-commercial or academic use [<xref ref-type="bibr" rid="B16">16</xref>
]. In this study we used the SVM concept for the classification of proteobacteria and plant (host) proteins. Learning was carried out by using three kinds of kernels: the linear (t = 0), the polynomial (t = 1) and the Radial Basis Function (RBF) (t = 2). We obtained the best performance from the RBF.</p>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p><bold>The concept of Support Vector Machine (SVM) in feature differentiation</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S15-S9-3"></graphic>
</fig>
</sec>
<sec><title>Evaluation</title>
<p>Evaluation of the performance of the three models is threshold dependent. The performance of our method was computed by using the following standard parameters [<xref ref-type="bibr" rid="B17">17</xref>
,<xref ref-type="bibr" rid="B18">18</xref>
].</p>
<p>(a) Sensitivity or coverage of positive examples: percent of proteobacterial proteins correctly predicted</p>
<p><disp-formula><mml:math id="M3" name="1471-2105-13-S15-S9-i3" overflow="scroll"><mml:mrow><mml:mtext>Sensitivity</mml:mtext>
<mml:mfenced close=")" open="("><mml:mrow><mml:mtext>Sn</mml:mtext>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mtext>TP</mml:mtext>
</mml:mrow>
<mml:mrow><mml:mtext>TP</mml:mtext>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mtext>FN</mml:mtext>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>(b) Specificity or coverage of negative examples: percent of plant proteins correctly predicted as plant protein</p>
<p><disp-formula><mml:math id="M4" name="1471-2105-13-S15-S9-i4" overflow="scroll"><mml:mrow><mml:mtext>Specificity</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mfenced close=")" open="("><mml:mrow><mml:mtext>Sp</mml:mtext>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mtext>TN</mml:mtext>
</mml:mrow>
<mml:mrow><mml:mtext>TN</mml:mtext>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mtext>FP</mml:mtext>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>(c) Accuracy: percent of correctly predicted proteins (bacterial and plant proteins).</p>
<p><disp-formula><mml:math id="M5" name="1471-2105-13-S15-S9-i5" overflow="scroll"><mml:mrow><mml:mtext>Accuracy</mml:mtext>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mfenced close=")" open="("><mml:mrow><mml:mtext>Acc</mml:mtext>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac><mml:mrow><mml:mtext>TP + TN</mml:mtext>
</mml:mrow>
<mml:mrow><mml:mtext>TP + FN + FP + FN</mml:mtext>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>(d) Matthews correlation coefficient (MCC) is considered to be the most robust parameter of any class prediction method [<xref ref-type="bibr" rid="B19">19</xref>
]. MCC equal to 1 is regarded as perfect prediction while 0 suggests completely random prediction.</p>
<p><disp-formula><mml:math id="M6" name="1471-2105-13-S15-S9-i6" overflow="scroll"><mml:mrow><mml:mtext>MCC</mml:mtext>
<mml:mo> = </mml:mo>
<mml:mfrac><mml:mrow><mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mtext>TP</mml:mtext>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mtext>TN</mml:mtext>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mtext>FP</mml:mtext>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mtext>FN</mml:mtext>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow><mml:msqrt><mml:mrow><mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mtext>TP + FP</mml:mtext>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mtext>TP + FN</mml:mtext>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mtext>TN + FP</mml:mtext>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow><mml:mtext>TN + FN</mml:mtext>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where TP represents truly predicted proteobacterial proteins, and TN represents truly predicted plant proteins. FP and FN are falsely predicted proteobacterial and plant proteins, respectively.</p>
</sec>
</sec>
<sec><title>Results and discussion</title>
<p>To test whether the AAC of proteobacterial and plant proteins differ significantly, we calculated AAC for both the proteobacterial (Table <xref ref-type="table" rid="T1">1</xref>
) and plant (Table <xref ref-type="table" rid="T2">2</xref>
) proteins datasets (Figure <xref ref-type="fig" rid="F4">4</xref>
). We observe differences of AAC between proteobacteria and plants with respect to alanine, cysteine, glycine, lysine, arginine and serine. We also calculated the DC for these two datasets (figure not shown). We input the following vector sets for the SVM: AAC, DC and a hybrid of AAC and DC [<xref ref-type="bibr" rid="B20">20</xref>
] models. We trained all three kernels (linear (Table <xref ref-type="table" rid="T3">3</xref>
), polynomial (Table <xref ref-type="table" rid="T4">4</xref>
) and RBF (Table <xref ref-type="table" rid="T5">5</xref>
) to identify the best-trained kernel. Comparison of the accuracies and MCCs obtained by all three kernels revealed that the RBF kernel performed best with all three redundancy percentages (Table <xref ref-type="table" rid="T5">5</xref>
). At 90% redundancy the SVM achieved a maximum accuracy of 92.44% and a 0.85 MCC for the AAC model [RBF parameters: g = 0.04, c = 4, j = 1], a maximum accuracy of 94.67% and 0.89 MCC for the DC model [g = 0.02, c = 6, j = 2] and for the hybrid model a maximum accuracy of 94.86% and a 0.90 MCC [g = 0.01, c = 8, j = 1]. At 50% redundancy, maximum accuracies for the AAC, DC and hybrid models were 91.62% (MCC 0.83) [g = 0.04, c = 2, j = 1], 94.12% (MCC 0.88) [g = 0.02, c = 4, j = 1] and 94.49% (MCC 0.89) [g = 0.01, c = 8, j = 2] respectively. At 30% redundancy, the maximum accuracies of the AAC, DC and hybrid models were 92.30% (MCC 0.84) [g = 0.05, c = 1, j = 1], 93.72% (MCC 0.87) [g = 0.03, c = 2, j = 2] and 93.84% (MCC 0.88) [g = 0.01, c = 4, j = 2]. As shown in Table <xref ref-type="table" rid="T5">5</xref>
 we achieved maximum accuracy with the hybrid model at 90% redundancy.</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p><bold>Comparative amino acid compositions of positive dataset (proteobacteria) and negative dataset (plants)</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S15-S9-4"></graphic>
</fig>
<table-wrap id="T3" position="float"><label>Table 3</label>
<caption><p>Results of SVM models based on AAC, DC and hybrid (AAC+DC) features at three different redundancy percentages using the linear kernel (t = 0).</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Redundancy(percentage)</th>
<th align="center" colspan="2">Amino acid composition (AAC)</th>
<th align="center" colspan="2">Dipeptide composition (DC)</th>
<th align="center" colspan="2">Hybrid (AAC+DC)</th>
</tr>
</thead>
<tbody><tr><td></td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
</tr>
<tr><td colspan="7"><hr></hr>
</td>
</tr>
<tr><td align="center">30</td>
<td align="center">87.87</td>
<td align="center">0.75</td>
<td align="center">90.74</td>
<td align="center">0.81</td>
<td align="center">89.10</td>
<td align="center">0.78</td>
</tr>
<tr><td align="center">50</td>
<td align="center">87.45</td>
<td align="center">0.74</td>
<td align="center">90.87</td>
<td align="center">0.81</td>
<td align="center">91.81</td>
<td align="center">0.83</td>
</tr>
<tr><td align="center">90</td>
<td align="center">87.45</td>
<td align="center">0.74</td>
<td align="center">90.87</td>
<td align="center">0.81</td>
<td align="center">89.33</td>
<td align="center">0.78</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4" position="float"><label>Table 4</label>
<caption><p>Results of SVM models based on AAC, DC and hybrid (AAC+DC) features dataset at three different redundancy percentages using polynomial kernel (t = 1)[d is another parameter used in this kernel and its value is given in parentheses].</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Redundancy(percentage)</th>
<th align="center" colspan="2">Amino acid composition (AAC)</th>
<th align="center" colspan="2">Dipeptide composition (DC)</th>
<th align="center" colspan="2">Hybrid (AAC+DC)</th>
</tr>
</thead>
<tbody><tr><td></td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
</tr>
<tr><td colspan="7"><hr></hr>
</td>
</tr>
<tr><td align="center">30</td>
<td align="center">(d5) 89.63</td>
<td align="center">0.78</td>
<td align="center">(d2) 92.06</td>
<td align="center">0.84</td>
<td align="center">(d6) 91.29</td>
<td align="center">0.82</td>
</tr>
<tr><td align="center">50</td>
<td align="center">(d4) 88.88</td>
<td align="center">0.77</td>
<td align="center">(d2) 91.98</td>
<td align="center">0.83</td>
<td align="center">(d4) 90.68</td>
<td align="center">0.81</td>
</tr>
<tr><td align="center">90</td>
<td align="center">(d5) 89.46</td>
<td align="center">0.78</td>
<td align="center">(d2) 92.54</td>
<td align="center">0.85</td>
<td align="center">(d4) 91.26</td>
<td align="center">0.82</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T5" position="float"><label>Table 5</label>
<caption><p>Results of SVM models based on AAC, DC and hybrid (AAC+DC) features at three different redundancy percentages using RBF kernel (t = 2)</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Redundancy(percentage)</th>
<th align="center" colspan="2">Amino acid composition (AAC)</th>
<th align="center" colspan="2">Dipeptide composition (DC)</th>
<th align="center" colspan="2">Hybrid (AAC+DC)</th>
</tr>
</thead>
<tbody><tr><td></td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
<td align="center"><bold>Accuracy (%)</bold>
</td>
<td align="center"><bold>MCC</bold>
</td>
</tr>
<tr><td colspan="7"><hr></hr>
</td>
</tr>
<tr><td align="center">30</td>
<td align="center">92.30</td>
<td align="center">0.84</td>
<td align="center">93.72</td>
<td align="center">0.87</td>
<td align="center">93.84</td>
<td align="center">0.88</td>
</tr>
<tr><td align="center">50</td>
<td align="center">91.62</td>
<td align="center">0.83</td>
<td align="center">94.12</td>
<td align="center">0.88</td>
<td align="center">94.49</td>
<td align="center">0.89</td>
</tr>
<tr><td align="center">90</td>
<td align="center">92.44</td>
<td align="center">0.85</td>
<td align="center">94.67</td>
<td align="center">0.89</td>
<td align="center">94.86</td>
<td align="center">0.90</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The result of the validation datatsets on six species (on all three models) are shown in Table <xref ref-type="table" rid="T6">6</xref>
. The hybrid model trained at 90% redundancy had the best accuracy only with exception of <italic>Rhizobium fredii </italic>
for which the 50% redundant model was better. As can be seen from Table <xref ref-type="table" rid="T5">5</xref>
, the hybrid model at 90% redundancy performed best overall for most species. It is possible that the decrease in performance obtained by removing more proteins based on their similarities is not due to the identity value, but due to a resulting imbalance in the training datasets since the redundancy criteria affected proteobacterial protein numbers more strongly than they did the plant protein numbers. Because these estimates are sensitive to the threshold for distinguishing positives from negatives, we constructed a ROC curve to examine the model's accuracy. ROC has been used to show the accuracy of constructed models [<xref ref-type="bibr" rid="B21">21</xref>
-<xref ref-type="bibr" rid="B29">29</xref>
]. The ROC curve is a graphical representation of sensitivity (true positive rate) vs. one minus specificity (false positive rate or true negative rate) for any binary classifier system [<xref ref-type="bibr" rid="B30">30</xref>
]. It is a threshold independent evaluation parameter and gives a value known as Area Under Curve (AUC) (Figure <xref ref-type="fig" rid="F5">5</xref>
) which shows the performance of a classifier in a two class problem [<xref ref-type="bibr" rid="B31">31</xref>
]. The higher the AUC, the more accurate the model. In the present study the AUC for hybrid model was 0.985 and therefore demonstrated the accuracy of the model.</p>
<table-wrap id="T6" position="float"><label>Table 6</label>
<caption><p>Validation (accuracy) percentage by SVM models trained on AAC, DC and hybrid features.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">AAC</th>
<th align="center">Serratia marcescens (127)</th>
<th align="center">Acidovorax citrulli (314)</th>
<th align="center">Rhizobium fredii (16)</th>
<th align="center">Solanum lycopersicum (413)</th>
<th align="center">Phaseolus vulgaris (159)</th>
<th align="center">Cucurbita pepo (15)</th>
</tr>
</thead>
<tbody><tr><td align="center">30</td>
<td align="center">78.74</td>
<td align="center">98.09</td>
<td align="center">75</td>
<td align="center">93.46</td>
<td align="center">94.97</td>
<td align="center">100</td>
</tr>
<tr><td align="center">50</td>
<td align="center">70.87</td>
<td align="center">96.18</td>
<td align="center">75</td>
<td align="center">94.19</td>
<td align="center">96.23</td>
<td align="center">100</td>
</tr>
<tr><td align="center">90</td>
<td align="center">70.87</td>
<td align="center">97.77</td>
<td align="center">75</td>
<td align="center">93.46</td>
<td align="center">96.86</td>
<td align="center">100</td>
</tr>
<tr><td colspan="7"><hr></hr>
</td>
</tr>
<tr><td align="center">DC</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr><td colspan="7"><hr></hr>
</td>
</tr>
<tr><td align="center">30</td>
<td align="center">73.23</td>
<td align="center">96.82</td>
<td align="center">75</td>
<td align="center">94.19</td>
<td align="center">97.48</td>
<td align="center">100</td>
</tr>
<tr><td align="center">50</td>
<td align="center">75.59</td>
<td align="center">97.45</td>
<td align="center">68.75</td>
<td align="center">94.92</td>
<td align="center">96.84</td>
<td align="center">100</td>
</tr>
<tr><td align="center">90</td>
<td align="center">74.8</td>
<td align="center">97.77</td>
<td align="center">62.5</td>
<td align="center">95.16</td>
<td align="center">98.72</td>
<td align="center">100</td>
</tr>
<tr><td colspan="7"><hr></hr>
</td>
</tr>
<tr><td align="center">Hybrid (AAC+DC)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr><td colspan="7"><hr></hr>
</td>
</tr>
<tr><td align="center">30</td>
<td align="center">74.8</td>
<td align="center">96.5</td>
<td align="center">75</td>
<td align="center">95.4</td>
<td align="center">98.74</td>
<td align="center">100</td>
</tr>
<tr><td align="center">50</td>
<td align="center">79.53</td>
<td align="center">99.04</td>
<td align="center">81.25</td>
<td align="center">93.46</td>
<td align="center">97.48</td>
<td align="center">100</td>
</tr>
<tr><td align="center">90</td>
<td align="center">81.1</td>
<td align="center">99.04</td>
<td align="center">79.53</td>
<td align="center">93.95</td>
<td align="center">98.11</td>
<td align="center">100</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>The accuracy is calculated by dividing the number of correct predictions by total number of protein inputs. The numbers of reviewed proteins are shown in parentheses.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F5" position="float"><label>Figure 5</label>
<caption><p><bold>The ROC curve (Relative Operating Characteristic) and the area under curve for the best hybrid model at 90% redundancy</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S15-S9-5"></graphic>
</fig>
<p>This SVM model can be used to assign a query sequence as to whether it originated from a plant or proteobacterium, thus enabling timely detection of the infection. It may also be used to identify food contamination with bacteria by screening samples by sequencing. SVM models can be used to work in the area of animal proteins. As we have developed a model for plant and proteobacteria, another model can be designed for animal protein and pathogenic proteobacterial proteins. Thus, SVMs can be used in a variety of fields of study.</p>
</sec>
<sec sec-type="conclusions"><title>Conclusion</title>
<p>The SVM models based on the hybrid approach using both amino acid and dipeptide features exhibited the maximum accuracy on both threshold dependent and threshold independent parameters. Best results were obtained with an RBF kernel and considering protein sets that did not contain any proteins that are more than 90% identical to another protein in the dataset. SVMs have great potential to handle large datasets and thus can be used for sorting proteobacterial sequences from a mixed background, like those found in metagenomic sequence data. As such, an SVM classifier would be a step forward in surveillance techniques for bacteria that lack previously characterized relatives. It may be useful for determining protein sequences obtained from non-sequenced genomes not yet present in Genbank. Other features like domains specific to nitrogen oxidising or fixing bacteria can also be used even to distinguish a pathogenic proteobacterium from a non-pathogenic proteobacterium. This may be used to determine the kinds of bacterial pathogens present in food samples thus improving food security. Human pathogens that are proteobacterial in nature also exist. Specific SVM models can be trained or designed to distinguish them. Thus SVMs hold greater potential for solving a variety of problems in biology.</p>
</sec>
<sec><title>List of abbreviations used</title>
<p>SVM: Support Vector Machine; AAC: Amino Acid Composition; DC: Dipeptide Composition; FAO: Food and Agricultural Organization; ROC: Receiver Operating Characteristic; AUC: Area Under The Curve; TP: True Positive; TN: True Negative; FN: False Negative; FP: False Positive.</p>
</sec>
<sec><title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><title>Authors' contributions</title>
<p>RV designed the study and performed the machine learning analysis and drafted the manuscript. UM conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.</p>
</sec>
</body>
<back><sec><title>Acknowledgements</title>
<p>The authors acknowledge the support from USDA-NIFA grant number 2010-85605-20542, and the Oklahoma Agricultural Experiment Station whose Director has approved the manuscript for publication. The authors thank William Schneider and Jacqueline Fletcher for helpful discussion. We also thank James Borrone and Rakesh Kaundal for reading of a draft manuscript.</p>
<p>This article has been published as part of <italic>BMC Bioinformatics </italic>
Volume 13 Supplement 15, 2012: Proceedings of the Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge. The full contents of the supplement are available online at <ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S15">http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S15</ext-link>
</p>
</sec>
<ref-list><ref id="B1"><mixed-citation publication-type="journal"><name><surname>Strange</surname>
<given-names>RN</given-names>
</name>
<name><surname>Scott</surname>
<given-names>PR</given-names>
</name>
<article-title>Plant disease: a threat to global food security</article-title>
<source>Annual review of phytopathology</source>
<year>2005</year>
<volume>43</volume>
<fpage>83</fpage>
<lpage>116</lpage>
<pub-id pub-id-type="doi">10.1146/annurev.phyto.43.113004.133839</pub-id>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><name><surname>Emerson</surname>
<given-names>D</given-names>
</name>
<name><surname>Rentz</surname>
<given-names>JA</given-names>
</name>
<name><surname>Lilburn</surname>
<given-names>TG</given-names>
</name>
<name><surname>Davis</surname>
<given-names>RE</given-names>
</name>
<name><surname>Aldrich</surname>
<given-names>H</given-names>
</name>
<name><surname>Chan</surname>
<given-names>C</given-names>
</name>
<name><surname>Moyer</surname>
<given-names>CL</given-names>
</name>
<article-title>A novel lineage of proteobacteria involved in formation of marine Fe-oxidizing microbial mat communities</article-title>
<source>PloS one</source>
<year>2007</year>
<volume>2</volume>
<issue>7</issue>
<fpage>e667</fpage>
<pub-id pub-id-type="pmid">17668050</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="book"><name><surname>Melcher</surname>
<given-names>U</given-names>
</name>
<name><surname>Grover</surname>
<given-names>V</given-names>
</name>
<person-group person-group-type="editor">Caranta C, Aranda MA, Tepfer M, López-Moya JJ</person-group>
<article-title>Genomic approaches to discovery of viral species diversity of non-cultivated plants</article-title>
<source>Recent Advances in Plant Virology</source>
<year>2011</year>
<publisher-name>Norfolk UK: Caister Academic Press</publisher-name>
<fpage>321</fpage>
<lpage>342</lpage>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><name><surname>Fletcher</surname>
<given-names>J</given-names>
</name>
<name><surname>Bender</surname>
<given-names>C</given-names>
</name>
<name><surname>Budowle</surname>
<given-names>B</given-names>
</name>
<name><surname>Cobb</surname>
<given-names>WT</given-names>
</name>
<name><surname>Gold</surname>
<given-names>SE</given-names>
</name>
<name><surname>Ishimaru</surname>
<given-names>CA</given-names>
</name>
<name><surname>Luster</surname>
<given-names>D</given-names>
</name>
<name><surname>Melcher</surname>
<given-names>U</given-names>
</name>
<name><surname>Murch</surname>
<given-names>R</given-names>
</name>
<name><surname>Scherm</surname>
<given-names>H</given-names>
</name>
<etal></etal>
<article-title>Plant Pathogen Forensics: Capabilities, Needs and Recommendations</article-title>
<source>MMBR</source>
<year>2006</year>
<volume>70</volume>
<issue>2</issue>
<fpage>450</fpage>
<lpage>471</lpage>
<pub-id pub-id-type="pmid">16760310</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><name><surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name><surname>Gish</surname>
<given-names>W</given-names>
</name>
<name><surname>Miller</surname>
<given-names>W</given-names>
</name>
<name><surname>Myers</surname>
<given-names>EW</given-names>
</name>
<name><surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<article-title>Basic local alignment search tool</article-title>
<source>Journal of molecular biology</source>
<year>1990</year>
<volume>215</volume>
<issue>3</issue>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="pmid">2231712</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="journal"><name><surname>Verma</surname>
<given-names>R</given-names>
</name>
<name><surname>Tiwari</surname>
<given-names>A</given-names>
</name>
<name><surname>Kaur</surname>
<given-names>S</given-names>
</name>
<name><surname>Varshney</surname>
<given-names>GC</given-names>
</name>
<name><surname>Raghava</surname>
<given-names>GP</given-names>
</name>
<article-title>Identification of proteins secreted by malaria parasite into erythrocyte using SVM and PSSM profiles</article-title>
<source>BMC bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>201</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-201</pub-id>
<pub-id pub-id-type="pmid">18416838</pub-id>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><name><surname>Kaundal</surname>
<given-names>R</given-names>
</name>
<name><surname>Raghava</surname>
<given-names>GP</given-names>
</name>
<article-title>RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information</article-title>
<source>Proteomics</source>
<year>2009</year>
<volume>9</volume>
<issue>9</issue>
<fpage>2324</fpage>
<lpage>2342</lpage>
<pub-id pub-id-type="doi">10.1002/pmic.200700597</pub-id>
<pub-id pub-id-type="pmid">19402042</pub-id>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="journal"><name><surname>Hu</surname>
<given-names>X</given-names>
</name>
<name><surname>Wong</surname>
<given-names>KK</given-names>
</name>
<name><surname>Young</surname>
<given-names>GS</given-names>
</name>
<name><surname>Guo</surname>
<given-names>L</given-names>
</name>
<name><surname>Wong</surname>
<given-names>ST</given-names>
</name>
<article-title>Support vector machine multiparametric MRI identification of pseudoprogression from tumor recurrence in patients with resected glioblastoma</article-title>
<source>Journal of magnetic resonance imaging: JMRI</source>
<year>2011</year>
<volume>33</volume>
<issue>2</issue>
<fpage>296</fpage>
<lpage>305</lpage>
<pub-id pub-id-type="doi">10.1002/jmri.22432</pub-id>
<pub-id pub-id-type="pmid">21274970</pub-id>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><name><surname>Choi</surname>
<given-names>S</given-names>
</name>
<name><surname>Jiang</surname>
<given-names>Z</given-names>
</name>
<article-title>Cardiac sound murmurs classification with autoregressive spectral analysis and multi-support vector machine technique</article-title>
<source>Computers in biology and medicine</source>
<year>2010</year>
<volume>40</volume>
<issue>1</issue>
<fpage>8</fpage>
<lpage>20</lpage>
<pub-id pub-id-type="doi">10.1016/j.compbiomed.2009.10.003</pub-id>
<pub-id pub-id-type="pmid">19926081</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="journal"><name><surname>Magnin</surname>
<given-names>B</given-names>
</name>
<name><surname>Mesrob</surname>
<given-names>L</given-names>
</name>
<name><surname>Kinkingnehun</surname>
<given-names>S</given-names>
</name>
<name><surname>Pelegrini-Issac</surname>
<given-names>M</given-names>
</name>
<name><surname>Colliot</surname>
<given-names>O</given-names>
</name>
<name><surname>Sarazin</surname>
<given-names>M</given-names>
</name>
<name><surname>Dubois</surname>
<given-names>B</given-names>
</name>
<name><surname>Lehericy</surname>
<given-names>S</given-names>
</name>
<name><surname>Benali</surname>
<given-names>H</given-names>
</name>
<article-title>Support vector machine-based classification of Alzheimer's disease from whole-brain anatomical MRI</article-title>
<source>Neuroradiology</source>
<year>2009</year>
<volume>51</volume>
<issue>2</issue>
<fpage>73</fpage>
<lpage>83</lpage>
<pub-id pub-id-type="doi">10.1007/s00234-008-0463-x</pub-id>
<pub-id pub-id-type="pmid">18846369</pub-id>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="other"><name><surname>Vert</surname>
<given-names>JP</given-names>
</name>
<article-title>Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings</article-title>
<source>Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing</source>
<year>2002</year>
<fpage>649</fpage>
<lpage>660</lpage>
<pub-id pub-id-type="pmid">11928516</pub-id>
</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="journal"><name><surname>Furey</surname>
<given-names>TS</given-names>
</name>
<name><surname>Cristianini</surname>
<given-names>N</given-names>
</name>
<name><surname>Duffy</surname>
<given-names>N</given-names>
</name>
<name><surname>Bednarski</surname>
<given-names>DW</given-names>
</name>
<name><surname>Schummer</surname>
<given-names>M</given-names>
</name>
<name><surname>Haussler</surname>
<given-names>D</given-names>
</name>
<article-title>Support vector machine classification and validation of cancer tissue samples using microarray expression data</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<issue>10</issue>
<fpage>906</fpage>
<lpage>914</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/16.10.906</pub-id>
<pub-id pub-id-type="pmid">11120680</pub-id>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="journal"><name><surname>Dharmasaroja</surname>
<given-names>P</given-names>
</name>
<name><surname>Dharmasaroja</surname>
<given-names>PA</given-names>
</name>
<article-title>Prediction of intracerebral hemorrhage following thrombolytic therapy for acute ischemic stroke using multiple artificial neural networks</article-title>
<source>Neurological research</source>
<year>2012</year>
<volume>34</volume>
<issue>2</issue>
<fpage>120</fpage>
<lpage>128</lpage>
<pub-id pub-id-type="pmid">22333462</pub-id>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><name><surname>Naguib</surname>
<given-names>IA</given-names>
</name>
<name><surname>Darwish</surname>
<given-names>HW</given-names>
</name>
<article-title>Support vector regression and artificial neural network models for stability indicating analysis of mebeverine hydrochloride and sulpiride mixtures in pharmaceutical preparation: a comparative study</article-title>
<source>Spectrochimica acta Part A, Molecular and biomolecular spectroscopy</source>
<year>2012</year>
<volume>86</volume>
<fpage>515</fpage>
<lpage>526</lpage>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="other"><name><surname>Dondoshansky</surname>
<given-names>IWY</given-names>
</name>
<source>BLASTCLUST - BLAST score-based single-linkage clustering</source>
<year>2000</year>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="book"><name><surname>Joachims</surname>
<given-names>T</given-names>
</name>
<source>Learning to classify text using support vector machines</source>
<year>2002</year>
<publisher-name>Boston: Kluwer Academic Publishers</publisher-name>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="journal"><name><surname>O'Dwyer</surname>
<given-names>L</given-names>
</name>
<name><surname>Lamberton</surname>
<given-names>F</given-names>
</name>
<name><surname>Bokde</surname>
<given-names>AL</given-names>
</name>
<name><surname>Ewers</surname>
<given-names>M</given-names>
</name>
<name><surname>Faluyi</surname>
<given-names>YO</given-names>
</name>
<name><surname>Tanner</surname>
<given-names>C</given-names>
</name>
<name><surname>Mazoyer</surname>
<given-names>B</given-names>
</name>
<name><surname>O'Neill</surname>
<given-names>D</given-names>
</name>
<name><surname>Bartley</surname>
<given-names>M</given-names>
</name>
<name><surname>Collins</surname>
<given-names>DR</given-names>
</name>
<etal></etal>
<article-title>Using support vector machines with multiple indices of diffusion for automated classification of mild cognitive impairment</article-title>
<source>PloS one</source>
<year>2012</year>
<volume>7</volume>
<issue>2</issue>
<fpage>e32441</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0032441</pub-id>
<pub-id pub-id-type="pmid">22384251</pub-id>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="journal"><name><surname>Ansari</surname>
<given-names>HR</given-names>
</name>
<name><surname>Raghava</surname>
<given-names>GP</given-names>
</name>
<article-title>Identification of conformational B-cell Epitopes in an antigen from its primary sequence</article-title>
<source>Immunome research</source>
<year>2010</year>
<volume>6</volume>
<fpage>6</fpage>
<pub-id pub-id-type="doi">10.1186/1745-7580-6-6</pub-id>
<pub-id pub-id-type="pmid">20961417</pub-id>
</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="journal"><name><surname>Baldi</surname>
<given-names>P</given-names>
</name>
<name><surname>Brunak</surname>
<given-names>S</given-names>
</name>
<name><surname>Chauvin</surname>
<given-names>Y</given-names>
</name>
<name><surname>Andersen</surname>
<given-names>CA</given-names>
</name>
<name><surname>Nielsen</surname>
<given-names>H</given-names>
</name>
<article-title>Assessing the accuracy of prediction algorithms for classification: an overview</article-title>
<source>Bioinformatics</source>
<year>2000</year>
<volume>16</volume>
<issue>5</issue>
<fpage>412</fpage>
<lpage>424</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/16.5.412</pub-id>
<pub-id pub-id-type="pmid">10871264</pub-id>
</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><name><surname>Verma</surname>
<given-names>R</given-names>
</name>
<name><surname>Varshney</surname>
<given-names>GC</given-names>
</name>
<name><surname>Raghava</surname>
<given-names>GP</given-names>
</name>
<article-title>Prediction of mitochondrial proteins of malaria parasite using split amino acid composition and PSSM profile</article-title>
<source>Amino acids</source>
<year>2010</year>
<volume>39</volume>
<issue>1</issue>
<fpage>101</fpage>
<lpage>110</lpage>
<pub-id pub-id-type="doi">10.1007/s00726-009-0381-1</pub-id>
<pub-id pub-id-type="pmid">19908123</pub-id>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="journal"><name><surname>Lu</surname>
<given-names>Q</given-names>
</name>
<name><surname>Cui</surname>
<given-names>Y</given-names>
</name>
<name><surname>Ye</surname>
<given-names>C</given-names>
</name>
<name><surname>Wei</surname>
<given-names>C</given-names>
</name>
<name><surname>Elston</surname>
<given-names>RC</given-names>
</name>
<article-title>Bagging optimal ROC curve method for predictive genetic tests, with an application for rheumatoid arthritis</article-title>
<source>Journal of biopharmaceutical statistics</source>
<year>2010</year>
<volume>20</volume>
<issue>2</issue>
<fpage>401</fpage>
<lpage>414</lpage>
<pub-id pub-id-type="doi">10.1080/10543400903572811</pub-id>
<pub-id pub-id-type="pmid">20309765</pub-id>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="journal"><name><surname>He</surname>
<given-names>X</given-names>
</name>
<name><surname>Frey</surname>
<given-names>E</given-names>
</name>
<article-title>ROC, LROC, FROC, AFROC: an alphabet soup</article-title>
<source>Journal of the American College of Radiology: JACR</source>
<year>2009</year>
<volume>6</volume>
<issue>9</issue>
<fpage>652</fpage>
<lpage>655</lpage>
<pub-id pub-id-type="doi">10.1016/j.jacr.2009.06.001</pub-id>
<pub-id pub-id-type="pmid">19720362</pub-id>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="journal"><name><surname>Chappell</surname>
<given-names>FM</given-names>
</name>
<name><surname>Raab</surname>
<given-names>GM</given-names>
</name>
<name><surname>Wardlaw</surname>
<given-names>JM</given-names>
</name>
<article-title>When are summary ROC curves appropriate for diagnostic meta-analyses?</article-title>
<source>Statistics in medicine</source>
<year>2009</year>
<volume>28</volume>
<issue>21</issue>
<fpage>2653</fpage>
<lpage>2668</lpage>
<pub-id pub-id-type="doi">10.1002/sim.3631</pub-id>
<pub-id pub-id-type="pmid">19591118</pub-id>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="journal"><name><surname>Algarabel</surname>
<given-names>S</given-names>
</name>
<name><surname>Pitarque</surname>
<given-names>A</given-names>
</name>
<article-title>ROC parameters in item and context recognition</article-title>
<source>Psicothema</source>
<year>2007</year>
<volume>19</volume>
<issue>1</issue>
<fpage>163</fpage>
<lpage>170</lpage>
<pub-id pub-id-type="pmid">17295999</pub-id>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><name><surname>Higashida</surname>
<given-names>Y</given-names>
</name>
<name><surname>Ideguchi</surname>
<given-names>T</given-names>
</name>
<name><surname>Muranaka</surname>
<given-names>T</given-names>
</name>
<name><surname>Tabata</surname>
<given-names>N</given-names>
</name>
<name><surname>Miyajima</surname>
<given-names>R</given-names>
</name>
<name><surname>Akazawa</surname>
<given-names>F</given-names>
</name>
<name><surname>Ikeda</surname>
<given-names>H</given-names>
</name>
<name><surname>Morimoto</surname>
<given-names>K</given-names>
</name>
<name><surname>Ohki</surname>
<given-names>M</given-names>
</name>
<name><surname>Toyofuku</surname>
<given-names>F</given-names>
</name>
<etal></etal>
<article-title>[ROC analysis of detection of interval changes in interstitial lung diseases on digital chest radiographs using the temporal subtraction technique]</article-title>
<source>Nihon Igaku Hoshasen Gakkai zasshi Nippon acta radiologica</source>
<year>2004</year>
<volume>64</volume>
<issue>1</issue>
<fpage>35</fpage>
<lpage>40</lpage>
<pub-id pub-id-type="pmid">14994509</pub-id>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><name><surname>Wiebringhaus</surname>
<given-names>R</given-names>
</name>
<name><surname>John</surname>
<given-names>V</given-names>
</name>
<name><surname>Muller</surname>
<given-names>RD</given-names>
</name>
<name><surname>Hirche</surname>
<given-names>H</given-names>
</name>
<name><surname>Voss</surname>
<given-names>M</given-names>
</name>
<name><surname>Callies</surname>
<given-names>R</given-names>
</name>
<article-title>[ROC analysis of image quality in digital luminescence radiography in comparison with current film-screen systems in mammography]</article-title>
<source>Aktuelle Radiologie</source>
<year>1995</year>
<volume>5</volume>
<issue>4</issue>
<fpage>263</fpage>
<lpage>267</lpage>
<pub-id pub-id-type="pmid">7548257</pub-id>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="journal"><name><surname>Daures</surname>
<given-names>JP</given-names>
</name>
<article-title>[Use of ROC curves in medical imaging]</article-title>
<source>Journal de radiologie</source>
<year>1991</year>
<volume>72</volume>
<issue>8-9</issue>
<fpage>445</fpage>
<lpage>461</lpage>
<pub-id pub-id-type="pmid">1920263</pub-id>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="journal"><name><surname>Hannequin</surname>
<given-names>P</given-names>
</name>
<name><surname>Liehn</surname>
<given-names>JC</given-names>
</name>
<name><surname>Delisle</surname>
<given-names>MJ</given-names>
</name>
<name><surname>Deltour</surname>
<given-names>G</given-names>
</name>
<name><surname>Valeyre</surname>
<given-names>J</given-names>
</name>
<article-title>ROC analysis in radioimmunoassay: an application to the interpretation of thyroglobulin measurement in the follow-up of thyroid carcinoma</article-title>
<source>European journal of nuclear medicine</source>
<year>1987</year>
<volume>13</volume>
<issue>4</issue>
<fpage>203</fpage>
<lpage>206</lpage>
<pub-id pub-id-type="pmid">3622567</pub-id>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><name><surname>Creelman</surname>
<given-names>CD</given-names>
</name>
<name><surname>Donaldson</surname>
<given-names>W</given-names>
</name>
<article-title>ROC curves for discrimination of linear extent</article-title>
<source>Journal of experimental psychology</source>
<year>1968</year>
<volume>77</volume>
<issue>3</issue>
<fpage>514</fpage>
<lpage>516</lpage>
<pub-id pub-id-type="pmid">5665590</pub-id>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="book"><name><surname>Balakrishnan</surname>
<given-names>N</given-names>
</name>
<source>Handbook of the logistic distribution</source>
<year>1992</year>
<publisher-name>New York: Dekker</publisher-name>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="journal"><name><surname>Zahr</surname>
<given-names>N</given-names>
</name>
<name><surname>Arnaud</surname>
<given-names>L</given-names>
</name>
<name><surname>Marquet</surname>
<given-names>P</given-names>
</name>
<name><surname>Haroche</surname>
<given-names>J</given-names>
</name>
<name><surname>Costedoat-Chalumeau</surname>
<given-names>N</given-names>
</name>
<name><surname>Hulot</surname>
<given-names>JS</given-names>
</name>
<name><surname>Funck-Brentano</surname>
<given-names>C</given-names>
</name>
<name><surname>Piette</surname>
<given-names>JC</given-names>
</name>
<name><surname>Amoura</surname>
<given-names>Z</given-names>
</name>
<article-title>Mycophenolic acid area under the curve correlates with disease activity in lupus patients treated with mycophenolate mofetil</article-title>
<source>Arthritis and rheumatism</source>
<year>2010</year>
<volume>62</volume>
<issue>7</issue>
<fpage>2047</fpage>
<lpage>2054</lpage>
<pub-id pub-id-type="pmid">20506558</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Bois/explor/OrangerV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A87  | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000A87  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Bois
   |area=    OrangerV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.25.
Data generation: Sat Dec 3 17:11:04 2016. Site generation: Wed Mar 6 18:18:32 2024

	Serveur d'exploration sur l'oranger
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'oranger

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri