Serveur d'exploration sur l'oranger

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000A890 ( Pmc/Corpus ); précédent : 000A889; suivant : 000A891 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning</title>
<author>
<name sortKey="Kaundal, Rakesh" sort="Kaundal, Rakesh" uniqKey="Kaundal R" first="Rakesh" last="Kaundal">Rakesh Kaundal</name>
<affiliation>
<nlm:aff id="I1">National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK,74078, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sahu, Sitanshu S" sort="Sahu, Sitanshu S" uniqKey="Sahu S" first="Sitanshu S" last="Sahu">Sitanshu S. Sahu</name>
<affiliation>
<nlm:aff id="I1">National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK,74078, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Verma, Ruchi" sort="Verma, Ruchi" uniqKey="Verma R" first="Ruchi" last="Verma">Ruchi Verma</name>
<affiliation>
<nlm:aff id="I2">Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK 74078, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Weirick, Tyler" sort="Weirick, Tyler" uniqKey="Weirick T" first="Tyler" last="Weirick">Tyler Weirick</name>
<affiliation>
<nlm:aff id="I1">National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK,74078, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">24266945</idno>
<idno type="pmc">3851450</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851450</idno>
<idno type="RBID">PMC:3851450</idno>
<idno type="doi">10.1186/1471-2105-14-S14-S7</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000A89</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning</title>
<author>
<name sortKey="Kaundal, Rakesh" sort="Kaundal, Rakesh" uniqKey="Kaundal R" first="Rakesh" last="Kaundal">Rakesh Kaundal</name>
<affiliation>
<nlm:aff id="I1">National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK,74078, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sahu, Sitanshu S" sort="Sahu, Sitanshu S" uniqKey="Sahu S" first="Sitanshu S" last="Sahu">Sitanshu S. Sahu</name>
<affiliation>
<nlm:aff id="I1">National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK,74078, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Verma, Ruchi" sort="Verma, Ruchi" uniqKey="Verma R" first="Ruchi" last="Verma">Ruchi Verma</name>
<affiliation>
<nlm:aff id="I2">Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK 74078, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Weirick, Tyler" sort="Weirick, Tyler" uniqKey="Weirick T" first="Tyler" last="Weirick">Tyler Weirick</name>
<affiliation>
<nlm:aff id="I1">National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK,74078, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color
<italic>etc</italic>
. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning.</p>
</sec>
<sec>
<title>Results</title>
<p>In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at
<ext-link ext-link-type="uri" xlink:href="http://bioinfo.okstate.edu/PLpred/">http://bioinfo.okstate.edu/PLpred/</ext-link>
for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Kleffmann, T" uniqKey="Kleffmann T">T Kleffmann</name>
</author>
<author>
<name sortKey="Von Zychlinski, A" uniqKey="Von Zychlinski A">A von Zychlinski</name>
</author>
<author>
<name sortKey="Russenberger, D" uniqKey="Russenberger D">D Russenberger</name>
</author>
<author>
<name sortKey="Hirsch Hoffmann, M" uniqKey="Hirsch Hoffmann M">M Hirsch-Hoffmann</name>
</author>
<author>
<name sortKey="Gehrig, P" uniqKey="Gehrig P">P Gehrig</name>
</author>
<author>
<name sortKey="Gruissem, W" uniqKey="Gruissem W">W Gruissem</name>
</author>
<author>
<name sortKey="Baginsky, S" uniqKey="Baginsky S">S Baginsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, L" uniqKey="Cui L">L Cui</name>
</author>
<author>
<name sortKey="Veeraraghavan, N" uniqKey="Veeraraghavan N">N Veeraraghavan</name>
</author>
<author>
<name sortKey="Richter, A" uniqKey="Richter A">A Richter</name>
</author>
<author>
<name sortKey="Wall, K" uniqKey="Wall K">K Wall</name>
</author>
<author>
<name sortKey="Jansen, Rk" uniqKey="Jansen R">RK Jansen</name>
</author>
<author>
<name sortKey="Leebens Mack, J" uniqKey="Leebens Mack J">J Leebens-Mack</name>
</author>
<author>
<name sortKey="Makalowska, I" uniqKey="Makalowska I">I Makalowska</name>
</author>
<author>
<name sortKey="Depamphilis, Cw" uniqKey="Depamphilis C">CW dePamphilis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gewolb, J" uniqKey="Gewolb J">J Gewolb</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baginsky, S" uniqKey="Baginsky S">S Baginsky</name>
</author>
<author>
<name sortKey="Grossmann, J" uniqKey="Grossmann J">J Grossmann</name>
</author>
<author>
<name sortKey="Gruissem, W" uniqKey="Gruissem W">W Gruissem</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Siddique, Ma" uniqKey="Siddique M">MA Siddique</name>
</author>
<author>
<name sortKey="Grossmann, J" uniqKey="Grossmann J">J Grossmann</name>
</author>
<author>
<name sortKey="Gruissem, W" uniqKey="Gruissem W">W Gruissem</name>
</author>
<author>
<name sortKey="Baginsky, S" uniqKey="Baginsky S">S Baginsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Balmer, Y" uniqKey="Balmer Y">Y Balmer</name>
</author>
<author>
<name sortKey="Vensel, Wh" uniqKey="Vensel W">WH Vensel</name>
</author>
<author>
<name sortKey="Cai, N" uniqKey="Cai N">N Cai</name>
</author>
<author>
<name sortKey="Manieri, W" uniqKey="Manieri W">W Manieri</name>
</author>
<author>
<name sortKey="Schurmann, P" uniqKey="Schurmann P">P Schurmann</name>
</author>
<author>
<name sortKey="Hurkman, Wj" uniqKey="Hurkman W">WJ Hurkman</name>
</author>
<author>
<name sortKey="Buchanan, Bb" uniqKey="Buchanan B">BB Buchanan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Andon, Nl" uniqKey="Andon N">NL Andon</name>
</author>
<author>
<name sortKey="Hollingworth, S" uniqKey="Hollingworth S">S Hollingworth</name>
</author>
<author>
<name sortKey="Koller, A" uniqKey="Koller A">A Koller</name>
</author>
<author>
<name sortKey="Greenland, Aj" uniqKey="Greenland A">AJ Greenland</name>
</author>
<author>
<name sortKey="Yates, Jr" uniqKey="Yates J">JR Yates</name>
</author>
<author>
<name sortKey="Haynes, Pa" uniqKey="Haynes P">PA Haynes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zeng, Y" uniqKey="Zeng Y">Y Zeng</name>
</author>
<author>
<name sortKey="Pan, Z" uniqKey="Pan Z">Z Pan</name>
</author>
<author>
<name sortKey="Ding, Y" uniqKey="Ding Y">Y Ding</name>
</author>
<author>
<name sortKey="Zhu, A" uniqKey="Zhu A">A Zhu</name>
</author>
<author>
<name sortKey="Cao, H" uniqKey="Cao H">H Cao</name>
</author>
<author>
<name sortKey="Xu, Q" uniqKey="Xu Q">Q Xu</name>
</author>
<author>
<name sortKey="Deng, X" uniqKey="Deng X">X Deng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Balmer, Y" uniqKey="Balmer Y">Y Balmer</name>
</author>
<author>
<name sortKey="Vensel, Wh" uniqKey="Vensel W">WH Vensel</name>
</author>
<author>
<name sortKey="Dupont, Fm" uniqKey="Dupont F">FM DuPont</name>
</author>
<author>
<name sortKey="Buchanan, Bb" uniqKey="Buchanan B">BB Buchanan</name>
</author>
<author>
<name sortKey="Hurkman, Wj" uniqKey="Hurkman W">WJ Hurkman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dupont, Fm" uniqKey="Dupont F">FM Dupont</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barsan, C" uniqKey="Barsan C">C Barsan</name>
</author>
<author>
<name sortKey="Sanchez Bel, P" uniqKey="Sanchez Bel P">P Sanchez-Bel</name>
</author>
<author>
<name sortKey="Rombaldi, C" uniqKey="Rombaldi C">C Rombaldi</name>
</author>
<author>
<name sortKey="Egea, I" uniqKey="Egea I">I Egea</name>
</author>
<author>
<name sortKey="Rossignol, M" uniqKey="Rossignol M">M Rossignol</name>
</author>
<author>
<name sortKey="Kuntz, M" uniqKey="Kuntz M">M Kuntz</name>
</author>
<author>
<name sortKey="Zouine, M" uniqKey="Zouine M">M Zouine</name>
</author>
<author>
<name sortKey="Latche, A" uniqKey="Latche A">A Latche</name>
</author>
<author>
<name sortKey="Bouzayen, M" uniqKey="Bouzayen M">M Bouzayen</name>
</author>
<author>
<name sortKey="Pech, Jc" uniqKey="Pech J">JC Pech</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baginsky, S" uniqKey="Baginsky S">S Baginsky</name>
</author>
<author>
<name sortKey="Kleffmann, T" uniqKey="Kleffmann T">T Kleffmann</name>
</author>
<author>
<name sortKey="Von Zychlinski, A" uniqKey="Von Zychlinski A">A von Zychlinski</name>
</author>
<author>
<name sortKey="Gruissem, W" uniqKey="Gruissem W">W Gruissem</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kleffmann, T" uniqKey="Kleffmann T">T Kleffmann</name>
</author>
<author>
<name sortKey="Hirsch Hoffmann, M" uniqKey="Hirsch Hoffmann M">M Hirsch-Hoffmann</name>
</author>
<author>
<name sortKey="Gruissem, W" uniqKey="Gruissem W">W Gruissem</name>
</author>
<author>
<name sortKey="Baginsky, S" uniqKey="Baginsky S">S Baginsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peltier, Jb" uniqKey="Peltier J">JB Peltier</name>
</author>
<author>
<name sortKey="Cai, Y" uniqKey="Cai Y">Y Cai</name>
</author>
<author>
<name sortKey="Sun, Q" uniqKey="Sun Q">Q Sun</name>
</author>
<author>
<name sortKey="Zabrouskov, V" uniqKey="Zabrouskov V">V Zabrouskov</name>
</author>
<author>
<name sortKey="Giacomelli, L" uniqKey="Giacomelli L">L Giacomelli</name>
</author>
<author>
<name sortKey="Rudella, A" uniqKey="Rudella A">A Rudella</name>
</author>
<author>
<name sortKey="Ytterberg, Aj" uniqKey="Ytterberg A">AJ Ytterberg</name>
</author>
<author>
<name sortKey="Rutschow, H" uniqKey="Rutschow H">H Rutschow</name>
</author>
<author>
<name sortKey="Van Wijk, Kj" uniqKey="Van Wijk K">KJ van Wijk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, Q" uniqKey="Sun Q">Q Sun</name>
</author>
<author>
<name sortKey="Zybailov, B" uniqKey="Zybailov B">B Zybailov</name>
</author>
<author>
<name sortKey="Majeran, W" uniqKey="Majeran W">W Majeran</name>
</author>
<author>
<name sortKey="Friso, G" uniqKey="Friso G">G Friso</name>
</author>
<author>
<name sortKey="Olinares, Pd" uniqKey="Olinares P">PD Olinares</name>
</author>
<author>
<name sortKey="Van Wijk, Kj" uniqKey="Van Wijk K">KJ van Wijk</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Emanuelsson, O" uniqKey="Emanuelsson O">O Emanuelsson</name>
</author>
<author>
<name sortKey="Nielsen, H" uniqKey="Nielsen H">H Nielsen</name>
</author>
<author>
<name sortKey="Brunak, S" uniqKey="Brunak S">S Brunak</name>
</author>
<author>
<name sortKey="Von Heijne, G" uniqKey="Von Heijne G">G von Heijne</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kleffmann, T" uniqKey="Kleffmann T">T Kleffmann</name>
</author>
<author>
<name sortKey="Russenberger, D" uniqKey="Russenberger D">D Russenberger</name>
</author>
<author>
<name sortKey="Von Zychlinski, A" uniqKey="Von Zychlinski A">A von Zychlinski</name>
</author>
<author>
<name sortKey="Christopher, W" uniqKey="Christopher W">W Christopher</name>
</author>
<author>
<name sortKey="Sjolander, K" uniqKey="Sjolander K">K Sjolander</name>
</author>
<author>
<name sortKey="Gruissem, W" uniqKey="Gruissem W">W Gruissem</name>
</author>
<author>
<name sortKey="Baginsky, S" uniqKey="Baginsky S">S Baginsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Richly, E" uniqKey="Richly E">E Richly</name>
</author>
<author>
<name sortKey="Leister, D" uniqKey="Leister D">D Leister</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nair, R" uniqKey="Nair R">R Nair</name>
</author>
<author>
<name sortKey="Rost, B" uniqKey="Rost B">B Rost</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jarvis, P" uniqKey="Jarvis P">P Jarvis</name>
</author>
<author>
<name sortKey="Robinson, C" uniqKey="Robinson C">C Robinson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Von Zychlinski, A" uniqKey="Von Zychlinski A">A von Zychlinski</name>
</author>
<author>
<name sortKey="Kleffmann, T" uniqKey="Kleffmann T">T Kleffmann</name>
</author>
<author>
<name sortKey="Krishnamurthy, N" uniqKey="Krishnamurthy N">N Krishnamurthy</name>
</author>
<author>
<name sortKey="Sjolander, K" uniqKey="Sjolander K">K Sjölander</name>
</author>
<author>
<name sortKey="Baginsky, S" uniqKey="Baginsky S">S Baginsky</name>
</author>
<author>
<name sortKey="Gruissem, W" uniqKey="Gruissem W">W Gruissem</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dondoshansky, Wy I" uniqKey="Dondoshansky W">WY I Dondoshansky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
<author>
<name sortKey="Shen, Hb" uniqKey="Shen H">HB Shen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
<author>
<name sortKey="Shen, Hb" uniqKey="Shen H">HB Shen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Briesemeister, S" uniqKey="Briesemeister S">S Briesemeister</name>
</author>
<author>
<name sortKey="Blum, T" uniqKey="Blum T">T Blum</name>
</author>
<author>
<name sortKey="Brady, S" uniqKey="Brady S">S Brady</name>
</author>
<author>
<name sortKey="Lam, Y" uniqKey="Lam Y">Y Lam</name>
</author>
<author>
<name sortKey="Kohlbacher, O" uniqKey="Kohlbacher O">O Kohlbacher</name>
</author>
<author>
<name sortKey="Shatkay, H" uniqKey="Shatkay H">H Shatkay</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, Cs" uniqKey="Yu C">CS Yu</name>
</author>
<author>
<name sortKey="Chen, Yc" uniqKey="Chen Y">YC Chen</name>
</author>
<author>
<name sortKey="Lu, Ch" uniqKey="Lu C">CH Lu</name>
</author>
<author>
<name sortKey="Hwang, Jk" uniqKey="Hwang J">JK Hwang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Su, Ec" uniqKey="Su E">EC Su</name>
</author>
<author>
<name sortKey="Chiu, Hs" uniqKey="Chiu H">HS Chiu</name>
</author>
<author>
<name sortKey="Lo, A" uniqKey="Lo A">A Lo</name>
</author>
<author>
<name sortKey="Hwang, Jk" uniqKey="Hwang J">JK Hwang</name>
</author>
<author>
<name sortKey="Sung, Ty" uniqKey="Sung T">TY Sung</name>
</author>
<author>
<name sortKey="Hsu, Wl" uniqKey="Hsu W">WL Hsu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Casadio, R" uniqKey="Casadio R">R Casadio</name>
</author>
<author>
<name sortKey="Martelli, Pl" uniqKey="Martelli P">PL Martelli</name>
</author>
<author>
<name sortKey="Pierleoni, A" uniqKey="Pierleoni A">A Pierleoni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaundal, R" uniqKey="Kaundal R">R Kaundal</name>
</author>
<author>
<name sortKey="Saini, R" uniqKey="Saini R">R Saini</name>
</author>
<author>
<name sortKey="Zhao, Px" uniqKey="Zhao P">PX Zhao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaundal, R" uniqKey="Kaundal R">R Kaundal</name>
</author>
<author>
<name sortKey="Raghava, Gps" uniqKey="Raghava G">GPS Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sahu, Ss" uniqKey="Sahu S">SS Sahu</name>
</author>
<author>
<name sortKey="Panda, G" uniqKey="Panda G">G Panda</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Garg, A" uniqKey="Garg A">A Garg</name>
</author>
<author>
<name sortKey="Bhasin, M" uniqKey="Bhasin M">M Bhasin</name>
</author>
<author>
<name sortKey="Raghava, Gps" uniqKey="Raghava G">GPS Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiang, X" uniqKey="Jiang X">X Jiang</name>
</author>
<author>
<name sortKey="Wei, R" uniqKey="Wei R">R Wei</name>
</author>
<author>
<name sortKey="Zhang, Tl" uniqKey="Zhang T">TL Zhang</name>
</author>
<author>
<name sortKey="Gu, Q" uniqKey="Gu Q">Q Gu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Tl" uniqKey="Zhang T">TL Zhang</name>
</author>
<author>
<name sortKey="Ding, Ys" uniqKey="Ding Y">YS Ding</name>
</author>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Tl, M" uniqKey="Tl M">M TL</name>
</author>
<author>
<name sortKey="Aa, S" uniqKey="Aa S">S AA</name>
</author>
<author>
<name sortKey="J, Z" uniqKey="J Z">Z J</name>
</author>
<author>
<name sortKey="Z, Z" uniqKey="Z Z">Z Z</name>
</author>
<author>
<name sortKey="W, M" uniqKey="W M">M W</name>
</author>
<author>
<name sortKey="Dj, L" uniqKey="Dj L">L DJ</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cortes, C" uniqKey="Cortes C">C Cortes</name>
</author>
<author>
<name sortKey="Vapnik, V" uniqKey="Vapnik V">V Vapnik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vapnik, V" uniqKey="Vapnik V">V Vapnik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hua, S" uniqKey="Hua S">S Hua</name>
</author>
<author>
<name sortKey="Sun, Z" uniqKey="Sun Z">Z Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Park, Kj" uniqKey="Park K">KJ Park</name>
</author>
<author>
<name sortKey="Kanehisa, M" uniqKey="Kanehisa M">M Kanehisa</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bhasin, M" uniqKey="Bhasin M">M Bhasin</name>
</author>
<author>
<name sortKey="Raghava, Gps" uniqKey="Raghava G">GPS Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xie, D" uniqKey="Xie D">D Xie</name>
</author>
<author>
<name sortKey="Li, A" uniqKey="Li A">A Li</name>
</author>
<author>
<name sortKey="Wang, M" uniqKey="Wang M">M Wang</name>
</author>
<author>
<name sortKey="Fan, Z" uniqKey="Fan Z">Z Fan</name>
</author>
<author>
<name sortKey="Feng, H" uniqKey="Feng H">H Feng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brown, Mps" uniqKey="Brown M">MPS Brown</name>
</author>
<author>
<name sortKey="Grundy, Wn" uniqKey="Grundy W">WN Grundy</name>
</author>
<author>
<name sortKey="Lin, D" uniqKey="Lin D">D Lin</name>
</author>
<author>
<name sortKey="Cristianini, N" uniqKey="Cristianini N">N Cristianini</name>
</author>
<author>
<name sortKey="Sugnet, Cw" uniqKey="Sugnet C">CW Sugnet</name>
</author>
<author>
<name sortKey="Furey, Ts" uniqKey="Furey T">TS Furey</name>
</author>
<author>
<name sortKey="Ares, M" uniqKey="Ares M">M Ares</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ward, Jj" uniqKey="Ward J">JJ Ward</name>
</author>
<author>
<name sortKey="Mcguffin, Lj" uniqKey="Mcguffin L">LJ McGuffin</name>
</author>
<author>
<name sortKey="Buxton, Bf" uniqKey="Buxton B">BF Buxton</name>
</author>
<author>
<name sortKey="Jones, Dt" uniqKey="Jones D">DT Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ding, Chq" uniqKey="Ding C">CHQ Ding</name>
</author>
<author>
<name sortKey="Dubchak, I" uniqKey="Dubchak I">I Dubchak</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kaundal, R" uniqKey="Kaundal R">R Kaundal</name>
</author>
<author>
<name sortKey="Kapoor, As" uniqKey="Kapoor A">AS Kapoor</name>
</author>
<author>
<name sortKey="Raghava, Gps" uniqKey="Raghava G">GPS Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cai, Yd" uniqKey="Cai Y">YD Cai</name>
</author>
<author>
<name sortKey="Zhou, Gp" uniqKey="Zhou G">GP Zhou</name>
</author>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Joachims, T" uniqKey="Joachims T">T Joachims</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cedano, J" uniqKey="Cedano J">J Cedano</name>
</author>
<author>
<name sortKey="Aloy, P" uniqKey="Aloy P">P Aloy</name>
</author>
<author>
<name sortKey="Perez Pons, Ja" uniqKey="Perez Pons J">JA Perez-Pons</name>
</author>
<author>
<name sortKey="Querol, E" uniqKey="Querol E">E Querol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benedito, Va" uniqKey="Benedito V">VA Benedito</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Dai, X" uniqKey="Dai X">X Dai</name>
</author>
<author>
<name sortKey="Wandrey, M" uniqKey="Wandrey M">M Wandrey</name>
</author>
<author>
<name sortKey="He, J" uniqKey="He J">J He</name>
</author>
<author>
<name sortKey="Kaundal, R" uniqKey="Kaundal R">R Kaundal</name>
</author>
<author>
<name sortKey="Torres Jerez, I" uniqKey="Torres Jerez I">I Torres-Jerez</name>
</author>
<author>
<name sortKey="Gomez, Sk" uniqKey="Gomez S">SK Gomez</name>
</author>
<author>
<name sortKey="Harrison, Mj" uniqKey="Harrison M">MJ Harrison</name>
</author>
<author>
<name sortKey="Tang, Y" uniqKey="Tang Y">Y Tang</name>
</author>
<author>
<name sortKey="Zhou, P" uniqKey="Zhou P">P Zhou</name>
</author>
<author>
<name sortKey="Udvardi, M" uniqKey="Udvardi M">M Udvardi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Andrade, Ma" uniqKey="Andrade M">MA Andrade</name>
</author>
<author>
<name sortKey="O Donoghue, Si" uniqKey="O Donoghue S">SI O'Donoghue</name>
</author>
<author>
<name sortKey="Rost, B" uniqKey="Rost B">B Rost</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Emanuelsson, O" uniqKey="Emanuelsson O">O Emanuelsson</name>
</author>
<author>
<name sortKey="Brunak, S" uniqKey="Brunak S">S Brunak</name>
</author>
<author>
<name sortKey="Von Heijne, G" uniqKey="Von Heijne G">G von Heijne</name>
</author>
<author>
<name sortKey="Nielsen, H" uniqKey="Nielsen H">H Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Horton, P" uniqKey="Horton P">P Horton</name>
</author>
<author>
<name sortKey="Park, Kj" uniqKey="Park K">KJ Park</name>
</author>
<author>
<name sortKey="Obayashi, T" uniqKey="Obayashi T">T Obayashi</name>
</author>
<author>
<name sortKey="Fujita, N" uniqKey="Fujita N">N Fujita</name>
</author>
<author>
<name sortKey="Harada, H" uniqKey="Harada H">H Harada</name>
</author>
<author>
<name sortKey="Adams Collier, Cj" uniqKey="Adams Collier C">CJ Adams-Collier</name>
</author>
<author>
<name sortKey="Nakai, K" uniqKey="Nakai K">K Nakai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Briesemeister, S" uniqKey="Briesemeister S">S Briesemeister</name>
</author>
<author>
<name sortKey="Rahnenfuhrer, J" uniqKey="Rahnenfuhrer J">J Rahnenführer</name>
</author>
<author>
<name sortKey="Kohlbacher, O" uniqKey="Kohlbacher O">O Kohlbacher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Zc" uniqKey="Wu Z">ZC Wu</name>
</author>
<author>
<name sortKey="Xiao, X" uniqKey="Xiao X">X Xiao</name>
</author>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">24266945</article-id>
<article-id pub-id-type="pmc">3851450</article-id>
<article-id pub-id-type="publisher-id">1471-2105-14-S14-S7</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-14-S14-S7</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Proceedings</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Identification and characterization of plastid-type proteins from sequence-attributed features using machine learning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Kaundal</surname>
<given-names>Rakesh</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>r.kaundal@okstate.edu</email>
</contrib>
<contrib contrib-type="author" equal-contrib="yes" id="A2">
<name>
<surname>Sahu</surname>
<given-names>Sitanshu S</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes" id="A3">
<name>
<surname>Verma</surname>
<given-names>Ruchi</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
</contrib>
<contrib contrib-type="author" id="A4">
<name>
<surname>Weirick</surname>
<given-names>Tyler</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
National Institute for Microbial Forensics & Food and Agricultural Biosecurity (NIMFFAB), Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK,74078, USA</aff>
<aff id="I2">
<label>2</label>
Department of Biochemistry & Molecular Biology, Oklahoma State University, Stillwater, OK 74078, USA</aff>
<pub-date pub-type="collection">
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>9</day>
<month>10</month>
<year>2013</year>
</pub-date>
<volume>14</volume>
<issue>Suppl 14</issue>
<supplement>
<named-content content-type="supplement-title">Proceedings of the Tenth Annual MCBIOS Conference</named-content>
<named-content content-type="supplement-editor">Jonathan D Wren (Senior Editor), Mikhail G Dozmorov, Dennis Burian, Rakesh Kaundal, Andy Perkins, Ed Perkins, Doris M Kupfer and Gordon K Springer</named-content>
<named-content content-type="supplement-sponsor">Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. Articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</named-content>
</supplement>
<fpage>S7</fpage>
<lpage>S7</lpage>
<permissions>
<copyright-statement>Copyright © 2013 Kaundal et al; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2013</copyright-year>
<copyright-holder>Kaundal et al; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/14/S14/S7"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color
<italic>etc</italic>
. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning.</p>
</sec>
<sec>
<title>Results</title>
<p>In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at
<ext-link ext-link-type="uri" xlink:href="http://bioinfo.okstate.edu/PLpred/">http://bioinfo.okstate.edu/PLpred/</ext-link>
for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes.</p>
</sec>
</abstract>
<conference>
<conf-date>5-6 April 2013</conf-date>
<conf-name>Tenth Annual MCBIOS Conference. Discovery in a sea of data</conf-name>
<conf-loc>Columbia, MO, USA</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>One of the major organelles in the plant cell is plastids; they perform essential biosynthetic and metabolic functions [
<xref ref-type="bibr" rid="B1">1</xref>
]. These functions include photosynthetic carbon fixation, synthesis of amino acids, fatty acids, starch and secondary metabolites such as pigments [
<xref ref-type="bibr" rid="B2">2</xref>
]. On the basis of their structure, pigment composition (color), metabolism and function, plastids are classified as 'chloroplasts' in photo-synthetically active tissues, 'chromoplasts' in fruits and petals, 'amyloplasts' in roots, 'etioplasts' in dark-grown seedlings and 'elaioplasts' that are found in the seed endosperm (Figure
<xref ref-type="fig" rid="F1">1</xref>
). Though plastids are of significant biological interest, current understanding of the metabolic functions and capacities of different plastid types is still limited [
<xref ref-type="bibr" rid="B3">3</xref>
]. Proteomics is a powerful approach to map the complete set of plastid proteins, and to infer plastid-type specific metabolic functions as well. Over the years, several proteomic analyses of plastids have been reported [
<xref ref-type="bibr" rid="B4">4</xref>
-
<xref ref-type="bibr" rid="B11">11</xref>
], although these come with limitations. Besides being time consuming, the experimental approaches face other constraints; for example, chloroplast proteome analysis is nearing saturation because the detection of new proteins is constrained by highly abundant photosynthetic proteins that dominate the proteome of photosynthetically active chloroplasts [
<xref ref-type="bibr" rid="B12">12</xref>
]. This has become more evident recently where the identical (or nearly identical) set of chloroplast proteins were repeatedly identified in different studies, whereas the reported detection rate of new proteins is small [
<xref ref-type="bibr" rid="B13">13</xref>
,
<xref ref-type="bibr" rid="B14">14</xref>
].</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Plastid and its various types with their respective organelle function</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S7-1"></graphic>
</fig>
<p>Moreover, in cases such as the ordered rearrangement of the proteome during plastid differentiation, profiling of static proteomes provides only limited information on proteome dynamics [
<xref ref-type="bibr" rid="B1">1</xref>
]. To circumvent these constraints and to increase proteome coverage, the development of highly efficient computational prediction tools is another complementary approach to provide useful global information about the plastid proteomes. Various proteomic approaches have led to the development of some databases available for plant plastids, for example, the Chloroplast Genome Database [
<xref ref-type="bibr" rid="B2">2</xref>
], plprot [
<xref ref-type="bibr" rid="B13">13</xref>
], PPDB [
<xref ref-type="bibr" rid="B15">15</xref>
]. However, there is no computational prediction system to identify and characterize various plastid types that could be used to classify '
<italic>unknown</italic>
' proteins. TargetP is currently the most widely known prediction program with a tested prediction accuracy around 68% for known plastid proteins, suggesting that a significant number of proteins cannot be identified by this type of analysis [
<xref ref-type="bibr" rid="B12">12</xref>
,
<xref ref-type="bibr" rid="B16">16</xref>
-
<xref ref-type="bibr" rid="B19">19</xref>
]. The most likely reason for this low performance is that TargetP is based on the presence of an N-terminal transit peptide region in a protein. In cases where there are alternate signals, it will fail to predict. It has been reported that plastid protein dynamics most likely also relate to different protein-targeting routes that exist in plastids [
<xref ref-type="bibr" rid="B20">20</xref>
]. This means that novel algorithms have to be developed based on whole amino acid sequence properties. Secondly, TargetP cannot predict the plastid type of a query protein e.g. whether it is a chloroplast, chromoplast, etioplast or an amyloplast protein. Previous attempts to predict plastid-types have been unsuccessful; several etioplast proteins are not predicted by TargetP for plastid localization [
<xref ref-type="bibr" rid="B21">21</xref>
].</p>
<p>In the current study, we have developed a prediction system for the genome-wide identification and classification of plastid proteins. This method works in two phases: first, distinction between plastid and non-plastid proteins, and second, classification of the identified plastid proteins into sub-classes (chloroplast, chromoplast, etioplast, and amyloplast). Various features of a protein sequence viz. Amino acid composition (AAC), Dipeptide composition (DIPEP), Pseudo Amino Acid Composition (PseAAC), N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
(NCC) composition, and Physicochemical properties are explored in a Support Vector Machine (SVM) framework to develop diverse prediction models. In addition, the models have been tested on 'independent test' datasets for better confidence and reliability. An online tool, PLpred has also been developed for use by the research community. With the advances in recent genomics technology and more and more genomes being sequenced, there has been a spur in data generation lately. The predicted proteomes of these genomes thus need annotation at a much faster pace. We have developed a prediction method trained on '
<italic>known</italic>
' plastid proteins, which could be used to annotate the '
<italic>unknown</italic>
' proteins predicted from these genomic DNA sequences. We believe the current method would be a useful resource in this direction.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>Dataset preparation</title>
<p>As the current method is developed in two phases, we discuss below the data collection and preparation separately. Data was collected accordingly from various online repositories.</p>
<p>(i).
<bold>Phase-I </bold>
(plastid vs. non-plastid): The amino acid sequences belonging to plastids were downloaded from the UniProt database (
<ext-link ext-link-type="uri" xlink:href="http://www.uniprot.org">http://www.uniprot.org</ext-link>
) by searching [keywords: plastids AND reviewed: yes], which gave 17,514 sequence hits. A similar number was collected for non-plastids by considering a combination of various classes such as nucleus, mitochondria, cytoplasm, Golgi body, cell membrane, peroxisome, vacuole, etc. However, the sequence number drastically reduced to 3535 in plastids and 3191 for non-plastids after we put a sequence identity cutoff of <30% (Table
<xref ref-type="table" rid="T1">1</xref>
) on each of them using BlastClust [
<xref ref-type="bibr" rid="B22">22</xref>
]. To avoid homology bias in machine learning, a 25 or 30% sequence identity cutoff threshold is needed to guarantee that none of the proteins included in the benchmark datasets has greater than this threshold identity to any other sequences in the dataset [
<xref ref-type="bibr" rid="B23">23</xref>
-
<xref ref-type="bibr" rid="B30">30</xref>
]. This was done within class as well as across the classes. Further, about 10% of the data (316 sequences each for plastids and non-plastids) was kept aside for later independent testing of the models. Testing on independent datasets that are not used in a machine learning process has been reported to be the best benchmark to test the performance of various prediction models [
<xref ref-type="bibr" rid="B29">29</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
]. Finally, 2844 plastid and 2844 non-plastid sequences were used as positive and negative training sets, respectively for developing the models (Table
<xref ref-type="table" rid="T1">1</xref>
).</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Number of protein sequences for plastids and non-plastid class used in phase-I (identification) training/testing</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Type</th>
<th align="center">Available</th>
<th align="center">< 30% cutoff
<break></break>
(within class)</th>
<th align="center">< 30% cutoff (across class)</th>
<th align="center">10% independent test set</th>
<th align="center">Training set</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Plastids</td>
<td align="center">17514</td>
<td align="center">3535</td>
<td align="center">3160</td>
<td align="center">316</td>
<td align="center">2844</td>
</tr>
<tr>
<td align="left">Non-plastids</td>
<td align="center">17514</td>
<td align="center">3191</td>
<td align="center">3160</td>
<td align="center">316</td>
<td align="center">2844</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<bold>Total</bold>
</td>
<td align="center">
<bold>35,028</bold>
</td>
<td align="center">
<bold>6726</bold>
</td>
<td align="center">
<bold>6320</bold>
</td>
<td align="center">
<bold>632</bold>
</td>
<td align="center">
<bold>5688</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>(ii).
<bold>Phase-II </bold>
(plastid-types): A thorough search was performed in various databases such as UniProt, NCBI (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>
), plprot (
<ext-link ext-link-type="uri" xlink:href="http://www.plprot.ethz.ch/">http://www.plprot.ethz.ch/</ext-link>
), PPDB (
<ext-link ext-link-type="uri" xlink:href="http://ppdb.tc.cornell.edu/">http://ppdb.tc.cornell.edu/</ext-link>
) to search proteins of various plastid types
<italic>viz</italic>
. chloroplast, chromoplast, etioplast, amyloplast, leucoplast, elaioplast, proteinoplast
<italic>etc</italic>
. As expected, we found enough hits for 'chloroplast' as compared to other classes (Table
<xref ref-type="table" rid="T2">2</xref>
). So to increase the number of sequences in other classes, we manually searched through literature related to proteomics studies in plastid types [
<xref ref-type="bibr" rid="B5">5</xref>
-
<xref ref-type="bibr" rid="B10">10</xref>
,
<xref ref-type="bibr" rid="B13">13</xref>
,
<xref ref-type="bibr" rid="B21">21</xref>
]. These sequences were carefully curated to each class and finally, a training set of four plastid types (chloroplast, chromoplast, etioplast, amyloplast) was generated to develop prediction models for plastid characterization (Table
<xref ref-type="table" rid="T3">3</xref>
). These were further subjected to BlastClust analysis for <30% identity cutoff and an independent test data set was kept aside, as was done earlier in phase-I data preparation. As a result, in phase-II, 542 sequences for chloroplast, 177 for chromoplast, 220 for etioplast and 232 for amyloplast were used as a training set for classification (Table
<xref ref-type="table" rid="T3">3</xref>
).</p>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Number of sequences available for plastid types in various online databases</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="center">UniProt</th>
<th align="center">NCBI</th>
<th align="center">PLprot</th>
<th align="center">PPDB</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Chloroplast</td>
<td align="center">15203</td>
<td align="center">47346</td>
<td align="center">690</td>
<td align="center">2115</td>
</tr>
<tr>
<td align="left">Chromoplast</td>
<td align="center">75</td>
<td align="center">91</td>
<td align="center">143</td>
<td align="center">11</td>
</tr>
<tr>
<td align="left">Etioplast</td>
<td align="center">56</td>
<td align="center">21</td>
<td align="center">240</td>
<td align="center">0</td>
</tr>
<tr>
<td align="left">Amyloplast</td>
<td align="center">78</td>
<td align="center">106</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="left">Leucoplast</td>
<td align="center">2</td>
<td align="center">3</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="left">Elaioplast</td>
<td align="center">1</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td align="left">Proteinoplast</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td colspan="5">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<bold>Total</bold>
</td>
<td align="center">
<bold>15,415</bold>
</td>
<td align="center">
<bold>47,568</bold>
</td>
<td align="center">
<bold>1073</bold>
</td>
<td align="center">
<bold>2126</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Number of protein sequences for various plastid types used in phase-II (classification) training/testing</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Plastid type</th>
<th align="center">Available</th>
<th align="center">< 30% cutoff</th>
<th align="center">10% independent test set</th>
<th align="center">Training set</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Chloroplast</td>
<td align="center">690</td>
<td align="center">602</td>
<td align="center">60</td>
<td align="center">542</td>
</tr>
<tr>
<td align="left">Chromoplast</td>
<td align="center">220</td>
<td align="center">194</td>
<td align="center">17</td>
<td align="center">177</td>
</tr>
<tr>
<td align="left">Etioplast</td>
<td align="center">270</td>
<td align="center">244</td>
<td align="center">24</td>
<td align="center">220</td>
</tr>
<tr>
<td align="left">Amyloplast</td>
<td align="center">313</td>
<td align="center">255</td>
<td align="center">23</td>
<td align="center">232</td>
</tr>
<tr>
<td colspan="5">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<bold>Total</bold>
</td>
<td align="center">
<bold>1493</bold>
</td>
<td align="center">
<bold>1295</bold>
</td>
<td align="center">
<bold>124</bold>
</td>
<td align="center">
<bold>1171</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Feature representation methods</title>
<p>The following diverse features were extracted from the protein sequences for use in a machine learning framework for developing prediction models in both phases:</p>
</sec>
<sec>
<title>Amino acid composition (AAC)</title>
<p>In this type of representation, each protein is defined by a 20-dimensional feature vector in Euclidean space. The protein corresponds to a point whose co-ordinates are given by the occurrence frequencies of the 20 constituent amino acids [
<xref ref-type="bibr" rid="B29">29</xref>
,
<xref ref-type="bibr" rid="B31">31</xref>
]. For a query protein
<italic>x</italic>
, let f(x
<sub>i</sub>
) represents the occurrence frequencies of its 20 constituent amino acids. Hence the composition of the amino acids (P
<sub>x</sub>
) in the query protein is given by,</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M1" name="1471-2105-14-S14-S7-i1" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle>
<mml:mo></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mtext></mml:mtext>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo></mml:mo>
<mml:mn>...20</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Hence, the protein
<italic>x </italic>
in the composition space is defined as:</p>
<p>
<disp-formula>
<mml:math id="M2" name="1471-2105-14-S14-S7-i2" overflow="scroll">
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>P</mml:mtext>
</mml:mstyle>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>x</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mrow>
<mml:mo class="MathClass-open">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>P</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>1</mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>x</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mstyle class="text">
<mml:mtext> </mml:mtext>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>P</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>2</mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>x</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mo class="MathClass-op"></mml:mo>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mstyle class="text">
<mml:mtext> </mml:mtext>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>P</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>2</mml:mtext>
</mml:mstyle>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>x</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo class="MathClass-close">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
<sec>
<title>Dipeptide composition (DIPEP)</title>
<p>To capture the global information about the protein sequence the dipeptide composition has been used for prediction of several protein's attributes such as structure, function and location [
<xref ref-type="bibr" rid="B29">29</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
,
<xref ref-type="bibr" rid="B32">32</xref>
]. In this representation, the occurrence frequencies of each dipeptide in the sequence is computed producing a fixed pattern length of 400 (20 × 20) for the query protein. Thus, the composition of the dipeptide is given as:</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M3" name="1471-2105-14-S14-S7-i3" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle>
<mml:mo></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mstyle>
<mml:mo></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo></mml:mo>
<mml:mn>...20</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>P(x
<sub>i</sub>
,x
<sub>j</sub>
)</italic>
is the fraction of each
<italic>(x
<sub>i</sub>
,x
<sub>j</sub>
)</italic>
dipeptide and
<italic>f(x
<sub>i</sub>
,x
<sub>j</sub>
)</italic>
is the frequency of occurrence of
<italic>(x
<sub>i</sub>
,x
<sub>j</sub>
)</italic>
dipeptides, and the denominator represents the total number of all possible dipeptides.</p>
</sec>
<sec>
<title>Pseudo amino acid composition (PseAAC)</title>
<p>In composition based methods, protein sequence order and length information are completely lost, which in turn may affect the prediction accuracy of the model. To include all the details of its sequence order and length, Chou [
<xref ref-type="bibr" rid="B33">33</xref>
] proposed an effective way of representing known proteins as pseudo amino acid compositions (PseAAC) in his seminal study.</p>
<p>In this representation, the protein character sequence is coded by some of its physicochemical properties. Since the amphiphilic property (hydrophobicity and hydrophilicity) plays a very important role in protein folding, and functioning [
<xref ref-type="bibr" rid="B34">34</xref>
,
<xref ref-type="bibr" rid="B35">35</xref>
], these two indices may be used to reflect effectively the sequence order effects.</p>
<p>Accordingly a protein sample (P) of length 'L' is represented in PseAAC form as:</p>
<p>
<disp-formula id="bmcM3">
<label>(3)</label>
<mml:math id="M4" name="1471-2105-14-S14-S7-i4" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mrow>
<mml:mtable class="gathered">
<mml:mtr>
<mml:mtd>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mo class="MathClass-op"></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mo class="MathClass-op"></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd></mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where</p>
<p>
<disp-formula id="bmcM4">
<label>(4)</label>
<mml:math id="M5" name="1471-2105-14-S14-S7-i5" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle>
<mml:mo></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle>
<mml:mo></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>τ</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>u</mml:mi>
<mml:mo></mml:mo>
<mml:mn>20</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle>
<mml:mo></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle>
<mml:mo></mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>λ</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>τ</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>u</mml:mi>
<mml:mo></mml:mo>
<mml:mn>20</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>λ</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>and</p>
<p>
<disp-formula id="bmcM5">
<label>(5)</label>
<mml:math id="M6" name="1471-2105-14-S14-S7-i6" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>τ</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo></mml:mo>
<mml:mi>τ</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo> </mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo></mml:mo>
<mml:mi>τ</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>τ</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mi>τ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo></mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo> </mml:mo>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>d</mml:mi>
<mml:mo> </mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo><</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M7" name="1471-2105-14-S14-S7-i7" overflow="scroll">
<mml:mi></mml:mi>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mi>H</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">×</mml:mo>
<mml:mi>H</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where f
<sub>i</sub>
, i = 1, 2, ..., 20 are the normalized occurrence frequencies corresponding to 20 native amino acids in the protein P, the symbol θ
<sub>τ</sub>
represents the j-tier sequence correlation factor computed using (4) with H(P
<sub>i</sub>
) and H(P
<sub>j</sub>
) representing hydrophobic and hydrophilic values of the amino acids P
<sub>i </sub>
and P
<sub>j </sub>
respectively and the symbol 'w' represents the weight factor, which governs the degree of the sequence order effect to be incorporated. In the present study, we have judicially chosen the weight as 0.1 and as 5 for better accuracy. In essence, the first 20 values in (3) represent the classic amino acid composition, the next 2λ values reflect the amphiphilic sequence correlation along the protein chain.</p>
</sec>
<sec>
<title>Terminal-based N-Center-C (NCC) amino acid composition</title>
<p>Many proteins in the cell contain important signal peptides at their N- or C-terminal region, which play as a marker for the subcellular location of the protein [
<xref ref-type="bibr" rid="B30">30</xref>
]. In this method, the amino acid composition of the N-terminal region, the C-terminal region, and the remaining center portion of protein sequence is computed separately and then concatenated together to represent a sample protein. The rationale is to provide more feature information to the SVM model based on the fact that percentage composition of a whole sequence may not give adequate weight to the compositional bias, which is known to be present in the protein terminus [
<xref ref-type="bibr" rid="B29">29</xref>
]. In this technique, a protein sample is represented as:</p>
<p>
<disp-formula id="bmcM6">
<label>(6)</label>
<mml:math id="M8" name="1471-2105-14-S14-S7-i8" overflow="scroll">
<mml:mi>P</mml:mi>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mrow>
<mml:mo class="MathClass-open">[</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mstyle class="text">
<mml:mtext> </mml:mtext>
</mml:mstyle>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">]</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>The AAC for each segment is computed using (1). Hence, a 60 dimensional feature vector is used to represent a protein. In an empirical study, the residue length of 25 was found to be the best compromise, both in phase-I and phase-II predictions.</p>
</sec>
<sec>
<title>Physicochemical property-based composition</title>
<p>The physicochemical properties of amino acids have been successfully used to predict protein function, structure, and subcellular locations [
<xref ref-type="bibr" rid="B41">41</xref>
,
<xref ref-type="bibr" rid="B53">53</xref>
]. In this study, we grouped the amino acids of a protein into twenty physicochemical classes such as the charged residues, hydrophilic (polar) and neutral, basic polar or positively charged, acidic polar or negatively charged, aliphatic, aromatic, small, tiny, large, hydrophobic (non-polar) and aromatic, hydrophobic (non-polar) and neutral, amidic (contains amide group), cyclic, hydroxylic, sulfur-containing, h-bonding, acidic and their amide, ionizable, forms covalent cross-link (disulfide bond), and theoretical pI (isoelectric point). A detailed description of these classes is provided in Table
<xref ref-type="table" rid="T4">4</xref>
. The composition of amino acids in each class is calculated as a feature to represent the protein. Thus, each protein in this method is represented by a 20 dimensional feature vector.</p>
<table-wrap id="T4" position="float">
<label>Table 4</label>
<caption>
<p>Physicochemical properties used to represent a protein for predicting plastids and their types using SVM.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">
<bold>Sr. No</bold>
.</th>
<th align="left">Physicochemical property</th>
<th align="center">Amino acids</th>
<th align="center"># feature</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">1</td>
<td align="left">Charged residues</td>
<td align="center">D, R, E, K, H</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">2</td>
<td align="left">Hydrophilic (polar) and neutral</td>
<td align="center">N, Q, S, T, Y</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">3</td>
<td align="left">Basic polar or Positively charged</td>
<td align="center">H, K, R</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">4</td>
<td align="left">Acidic polar or Negatively charged</td>
<td align="center">D, E</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">Aliphatic</td>
<td align="center">A, G, I, L, V</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">6</td>
<td align="left">Aromatic</td>
<td align="center">F, W, Y</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">7</td>
<td align="left">Small</td>
<td align="center">T, D, N</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">8</td>
<td align="left">Tiny</td>
<td align="center">G, A, S, P</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">9</td>
<td align="left">Large</td>
<td align="center">F, R, W, Y</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">Hydrophobic (non-polar) and aromatic</td>
<td align="center">W, F</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">11</td>
<td align="left">Hydrophobic (non-polar) and neutral</td>
<td align="center">A, C, G, I, L, M, F, P, W, V</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">12</td>
<td align="left">Amidic (
<italic>contains amide group</italic>
)</td>
<td align="center">N, Q</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">13</td>
<td align="left">Cyclic</td>
<td align="center">P</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">14</td>
<td align="left">Hydroxylic</td>
<td align="center">S, T</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">15</td>
<td align="left">Sulfur-containing</td>
<td align="center">C, M</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">16</td>
<td align="left">H-bonding</td>
<td align="center">C, W, N, Q, S, T, Y, K, R, H, D, E</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">17</td>
<td align="left">Acidic and their Amide</td>
<td align="center">D, E, N, Q</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">18</td>
<td align="left">Ionizable</td>
<td align="center">D, E, H, C, Y, K, R</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">19</td>
<td align="left">Forms covalent cross-link (disulfide bond)</td>
<td align="center">C</td>
<td align="center">1</td>
</tr>
<tr>
<td align="left">20</td>
<td align="left">Theoretical pI (isoelectric point)</td>
<td align="center">-</td>
<td align="center">1</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Similarity search-based PSI-BLAST module</title>
<p>In this study, we also performed PSI-BLAST based predictions in which a query sequence is searched based on its similarity against the non-redundant database; all of the UniProt/Swiss-Prot used as a target database. Previous studies have suggested that PSI-BLAST has the capability to detect remote homologies, and is thus preferred over the normal BLAST. It carries out an iterative search in which sequences found in one round are used to build a new score model for the next round of searching [
<xref ref-type="bibr" rid="B36">36</xref>
]. Three iterations were carried out at a best cut-off E-value of 0.001. This module was run separately for plastid and non-plastid data and the various plastid-type classes depending upon the similarity of the query protein to the proteins in the dataset. The module would return "
<italic>unknown protein type</italic>
" if no significant similarity is obtained. Accordingly, values for H (number of total hits), C (number of correct hits), P (percent of correct hits), and A (percent accuracy) are calculated to evaluate the PSI-BLAST based prediction performance.</p>
</sec>
<sec>
<title>Support Vector Machine (SVM)</title>
<p>Support Vector Machine is a class of learning machines based on optimization principle from statistical learning theory, originally introduced by Vapnik and co-workers [
<xref ref-type="bibr" rid="B37">37</xref>
,
<xref ref-type="bibr" rid="B38">38</xref>
] about two decades ago. It has been well studied and extensively applied in the areas of pattern recognition, regression and classification problems in various fields of science and engineering, for example: predicting protein subcellular localization [
<xref ref-type="bibr" rid="B19">19</xref>
,
<xref ref-type="bibr" rid="B29">29</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
,
<xref ref-type="bibr" rid="B32">32</xref>
,
<xref ref-type="bibr" rid="B39">39</xref>
-
<xref ref-type="bibr" rid="B42">42</xref>
], classifying microarray data [
<xref ref-type="bibr" rid="B43">43</xref>
], predicting protein secondary structure [
<xref ref-type="bibr" rid="B44">44</xref>
,
<xref ref-type="bibr" rid="B45">45</xref>
], forecasting disease [
<xref ref-type="bibr" rid="B46">46</xref>
], predicting membrane protein type [
<xref ref-type="bibr" rid="B47">47</xref>
] and many other areas. In classification problems, the objective of SVM is to separate the training data with a maximum margin while maintaining reasonable computing efficiency. To handle the multi-class classification, a simple strategy is used by reducing the multi-classification to a series of binary classifications. The popular methods include One-Versus-Rest (OVR), One-Versus-One (OVO), and Directed Acyclic Graph Support Vector Machines (DAGSVM). In this work, we followed the OVO method for the multi-classification problem. More details of the theory of SVM have been described elsewhere [
<xref ref-type="bibr" rid="B37">37</xref>
,
<xref ref-type="bibr" rid="B38">38</xref>
].</p>
<p>To develop various classifiers, we have used SVM_light [
<xref ref-type="bibr" rid="B48">48</xref>
], a freely downloadable package of SVM (
<ext-link ext-link-type="uri" xlink:href="http://svmlight.joachims.org/">http://svmlight.joachims.org/</ext-link>
). This software enables the user to define a number of parameters besides allowing a choice of built-in kernel functions, including linear, polynomial, and radial basis function (RBF). In our preliminary study, it was elucidated that the RBF kernel performed better than the linear and polynomial kernels (
<italic>data not shown</italic>
). Therefore, we used the RBF kernel in all further analysis and have presented the results accordingly.</p>
<p>
<bold>Training/testing schema: </bold>
In both steps, the training data was transformed into a five-fold cross-validation scheme, where the dataset is divided into five different parts. Four parts are combined to form one training set and the models developed from this set are then tested on the fifth part (called testing set). This process is repeated five times changing the training/testing set each time, and is thus called five-fold cross-validation. In addition, we have also tested the performance of our models on independent test datasets, those that have not been used in any kind of machine learning.</p>
<p>
<bold>Evaluation parameters</bold>
: The performance of models developed in both the phase-I (single class) and phase-II (multi class) predictions is evaluated based on the following standard parameters:</p>
<p>Sensitivity or coverage of positive examples: It is the percent of plastid proteins correctly predicted,</p>
<p>
<disp-formula id="bmcM7">
<label>(7)</label>
<mml:math id="M9" name="1471-2105-14-S14-S7-i9" overflow="scroll">
<mml:mrow>
<mml:mtext>Sensitivity</mml:mtext>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Specificity or coverage of negative examples: It is percent of non-plastids correctly predicted as non-plastid proteins,</p>
<p>
<disp-formula id="bmcM8">
<label>(8)</label>
<mml:math id="M10" name="1471-2105-14-S14-S7-i10" overflow="scroll">
<mml:mrow>
<mml:mrow>
<mml:mtext>Specificity</mml:mtext>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>x</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Accuracy: It is the percentage of correctly predicted proteins (plastids and non-plastids proteins),</p>
<p>
<disp-formula id="bmcM9">
<label>(9)</label>
<mml:math id="M11" name="1471-2105-14-S14-S7-i11" overflow="scroll">
<mml:mrow>
<mml:mrow>
<mml:mtext>Accuracy</mml:mtext>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mi>x</mml:mi>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Precision: It is the percentage of positive predictions those are correct calculated as:</p>
<p>
<disp-formula id="bmcM10">
<label>(10)</label>
<mml:math id="M12" name="1471-2105-14-S14-S7-i12" overflow="scroll">
<mml:mstyle class="text">
<mml:mtext>Precision</mml:mtext>
</mml:mstyle>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle class="text">
<mml:mtext>x</mml:mtext>
</mml:mstyle>
<mml:mn>100</mml:mn>
</mml:math>
</disp-formula>
</p>
<p>Rate of False Predictions (RFP): also known as False Discovery Rate (FDR), is the
<italic>expected </italic>
percent of false predictions in the set of predictions,</p>
<p>
<disp-formula id="bmcM11">
<label>(11)</label>
<mml:math id="M13" name="1471-2105-14-S14-S7-i13" overflow="scroll">
<mml:mstyle class="text">
<mml:mtext>Rate of False Prediction</mml:mtext>
</mml:mstyle>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>RFP</mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle class="text">
<mml:mtext>x</mml:mtext>
</mml:mstyle>
<mml:mn>100</mml:mn>
</mml:math>
</disp-formula>
</p>
<p>Error Rate: gives an overall idea about the total percentage of wrong predictions calculated as:</p>
<p>
<disp-formula id="bmcM12">
<label>(12)</label>
<mml:math id="M14" name="1471-2105-14-S14-S7-i14" overflow="scroll">
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>Error Rate</mml:mtext>
</mml:mstyle>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext>ER</mml:mtext>
</mml:mstyle>
</mml:mrow>
</mml:mfenced>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mstyle class="text">
<mml:mtext>x</mml:mtext>
</mml:mstyle>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Matthews correlation coefficient (MCC): considered to be the most robust parameter of any class prediction method. MCC equal to 1 is regarded as perfect prediction while 0 for completely random prediction.</p>
<p>
<disp-formula id="bmcM13">
<label>(13)</label>
<mml:math id="M15" name="1471-2105-14-S14-S7-i15" overflow="scroll">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-bin">-</mml:mo>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False Negatives.</p>
<p>In addition, we also plot Receiver Operating Characteristic (ROC) curves and calculate the Area Under Curve (AUC) for each of the classifiers.</p>
</sec>
</sec>
<sec>
<title>Results and discussion</title>
<p>At first, we will describe the homology-based prediction results and then, the SVM-based performance for both the phases of plastid-types prediction, including testing on independent datasets.</p>
<sec>
<title>(i). Homology-based PSI-BLAST</title>
<p>A biologist would always want to first check the similarity-based predcitions as is done usually in research labs. We performed PSI-BLAST of the 2844 positive set and 2844 negative set proteins against the UniProt/Swiss-Prot datatbase. Results in Table
<xref ref-type="table" rid="T5">5</xref>
show that, although the negative set proteins could be predicted with about 82% accuracy, the positive proteins are only correctly annotated with about 50% accuracy. About 1443 plastid preoteins are correctly predicted out of 2844. Thus a significant fraction of the positive set (~50%) could not be predicted using a homology-based approach. In phase-II, the performance of Psi-Blast was even worse. Only 167 chloropast proteins could be predicted correctly out of 542 in the query set with an accuracy of about 31% (Table
<xref ref-type="table" rid="T5">5</xref>
). Other plastid-type results: chromoplast (9.61%), amyloplast (18.10%) and etioplast (1.82%) show that the similarity-based approach fails in characterizing various forms of plastids. Machine learning-based algorithms such as using the SVMs are thus a good alternative for prediction purposes. We describe here the SVM results in detail for both the steps separately.</p>
<table-wrap id="T5" position="float">
<label>Table 5</label>
<caption>
<p>Overall performance of homology-based (PSI-BLAST) prediction for the identification of plastid vs. non-plastid proteins and the classification of diverse plastid-types.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="center">No. of sequences</th>
<th align="center">H</th>
<th align="center">C</th>
<th align="center">P
<break></break>
(%)</th>
<th align="center">A
<break></break>
(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>Phase-I:</bold>
</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td align="left">
<bold>Plastids</bold>
</td>
<td align="center">2844</td>
<td align="center">2731</td>
<td align="center">1443</td>
<td align="center">52.84</td>
<td align="center">50.74</td>
</tr>
<tr>
<td align="left">
<bold>Non-plastids</bold>
</td>
<td align="center">2844</td>
<td align="center">2726</td>
<td align="center">2337</td>
<td align="center">85.73</td>
<td align="center">82.17</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">
<bold>Phase-II:</bold>
</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td align="left">
<bold>Chloroplast</bold>
</td>
<td align="center">542</td>
<td align="center">483</td>
<td align="center">167</td>
<td align="center">34.58</td>
<td align="center">30.81</td>
</tr>
<tr>
<td align="left">
<bold>Chromoplast</bold>
</td>
<td align="center">177</td>
<td align="center">172</td>
<td align="center">17</td>
<td align="center">9.88</td>
<td align="center">9.61</td>
</tr>
<tr>
<td align="left">
<bold>Etioplast</bold>
</td>
<td align="center">220</td>
<td align="center">204</td>
<td align="center">4</td>
<td align="center">1.96</td>
<td align="center">1.82</td>
</tr>
<tr>
<td align="left">
<bold>Amyloplast</bold>
</td>
<td align="center">232</td>
<td align="center">219</td>
<td align="center">42</td>
<td align="center">19.18</td>
<td align="center">18.10</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>*at e-value = 0.001; H = Number of total hits; C = Number of correct or true hits; P = Percent of correct hits calculated as (C/H*100); A = Percent accuracy calculated as (C/total number of proteins in a particular class * 100).</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>(ii). Phase-I: SVM-based identification of plastid proteins</title>
<p>First, the amino acid frequencies of both plastid and non-plastid proteins were compared. Figure
<xref ref-type="fig" rid="F2">2</xref>
shows a bar-graph comparing the amino acid frequencies of plastid and non-plastid proteins, concluding that there is a significant variation in both the compositions. The statistical significance of this variation was assessed with a p-value, estimated with a two-tailed Student's t-test (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S1). A summary of the observations as reported in Table S1 and Figure
<xref ref-type="fig" rid="F2">2</xref>
indicate that the composition of 11 amino acids
<italic>viz</italic>
. Ala (A), Cys (C), Ile (I), Met (M), Pro (P), Val (V), Asp (D), His (H), Lys (K), Ser (S), and Trp (W) is significantly different in plastids and non-plastids. Secondly, to have more understanding in the variation of compositional features, we grouped the amino acids into seven classes based on the chemical and/or structural properties of their side chains
<italic>viz</italic>
. aliphatic, aromatic, acidic, basic, hydroxylic, Sulfur-containing, and amidic. We assessed the significance of difference by the t-test and listed in Table S2 (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
). The p-values at 0.05 level of significance shows that aromatic, hydroxylic and sulfur-containing amino acids vary significantly in plastids and non-plastids.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>A comaprative bar-graph of amino acid composition differences in plastid and non-plastid proteins</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S7-2"></graphic>
</fig>
<p>These two tests show that it is possible to develop various composition-based models for distinguishing plastid and non-plastid proteins. In a five-fold cross-validation approach, the simple amino acid composition-based model achieves a sensitivity of 85.37%, prediction accuracy of 85.51% with a MCC of 0.71 (Table
<xref ref-type="table" rid="T6">6</xref>
). The precision rate is also more than 85%, which shows that plastid proteins could be predicted with a high positive prediction rate. Many researchers have reported the usefulness of amino acid composition for prediction purposes, e.g. in prediction of subcellular localization [
<xref ref-type="bibr" rid="B49">49</xref>
,
<xref ref-type="bibr" rid="B50">50</xref>
]; and how it carries a signal, almost entirely due to the surface residues that identifies the subcellular location [
<xref ref-type="bibr" rid="B51">51</xref>
]. Next, we developed a PseAAC classifier. The performance increased with a sensitivity of 89.45%, accuracy of 86.20% and a slight increase in the MCC (0.73) (Table
<xref ref-type="table" rid="T6">6</xref>
). The PseAAC approach takes into consideration the composition, based on physicochemical properties and also includes the correlation factors associated with the protein chain, thus providing better and more dimensional information to the SVM.</p>
<table-wrap id="T6" position="float">
<label>Table 6</label>
<caption>
<p>Overall performance of various feature classifiers in 5-fold cross-validation for the identification of plastid vs. non-plastid proteins (
<italic>phase-</italic>
<italic>I</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Feature type</th>
<th align="center">Sensitivity
<break></break>
(%)</th>
<th align="center">Specificity
<break></break>
(%)</th>
<th align="center">Accuracy
<break></break>
(%)</th>
<th align="center">MCC</th>
<th align="center">Precision (%)</th>
<th align="center">RFP (%)</th>
<th align="center">SVM kernel type</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>AAC</bold>
</td>
<td align="center">85.37</td>
<td align="center">85.65</td>
<td align="center">85.51</td>
<td align="center">0.71</td>
<td align="center">85.61</td>
<td align="center">14.39</td>
<td align="center">RBF (
<italic>γ </italic>
= 370,
<italic>C </italic>
= 3,
<italic>j </italic>
= 1)</td>
</tr>
<tr>
<td align="left">
<bold>PseAA</bold>
</td>
<td align="center">89.45</td>
<td align="center">82.95</td>
<td align="center">86.20</td>
<td align="center">0.73</td>
<td align="center">83.99</td>
<td align="center">16.01</td>
<td align="center">RBF (
<italic>γ </italic>
= 385,
<italic>C </italic>
= 2,
<italic>j </italic>
= 2)</td>
</tr>
<tr>
<td align="left">
<bold>Dipep</bold>
</td>
<td align="center">88.08</td>
<td align="center">85.51</td>
<td align="center">86.80</td>
<td align="center">0.74</td>
<td align="center">85.88</td>
<td align="center">14.12</td>
<td align="center">RBF (
<italic>γ </italic>
= 265,
<italic>C </italic>
= 6,
<italic>j </italic>
= 1)</td>
</tr>
<tr>
<td align="left">
<bold>NCC</bold>
</td>
<td align="center">84.14</td>
<td align="center">89.66</td>
<td align="center">86.90</td>
<td align="center">0.74</td>
<td align="center">89.06</td>
<td align="center">10.94</td>
<td align="center">RBF (
<italic>γ </italic>
= 20,
<italic>C </italic>
= 3,
<italic>j </italic>
= 2)</td>
</tr>
<tr>
<td align="left">
<bold>Phys-Chem</bold>
</td>
<td align="center">79.57</td>
<td align="center">81.05</td>
<td align="center">80.31</td>
<td align="center">0.61</td>
<td align="center">80.76</td>
<td align="center">19.24</td>
<td align="center">RBF (
<italic>γ </italic>
= 135,
<italic>C </italic>
= 2,
<italic>j </italic>
= 1)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>*best values obtained at ≥ 0.0 threshold, individual performance of these classifiers can be seen in supplementary material; AAC = amino acid composition, PseAA = Pseudo amino acid composition, Dipep = Dipeptide composition, NCC = N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition (
<italic>sequence divided into 3 parts</italic>
), Phys-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, RFP = Rate of False Predictions, RBF = Radial Basis Function of SVM.</p>
</table-wrap-foot>
</table-wrap>
<p>To include more diverse information, we further develop a dipeptide composition-based model. This classifier achieves the highest MCC (0.74) of all models with a slight increase in accuracy (86.80%) and a significant reduction in the rate of false prediction (14.12%). It has been reported in earlier studies that dipeptide composition performs better as compared to the simple amino acid composition [
<xref ref-type="bibr" rid="B29">29</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
,
<xref ref-type="bibr" rid="B32">32</xref>
], because it also provides the sequence order information along with the composition. Next, we compared the results of NCC and physicochemical property-based composition models. The physicochemical model, with an overall sensitivity of 79.57% and MCC of 0.61, did not perform well in predicting the plastid proteins comparatively. The NCC-based classifier achieves an accuracy of 86.90 % with a MCC of 0.74, which is at par with the DIPEP model, although the sensitivity was less in comparison. However, it achieves a higher specificity (89.66%) and precision (89.06%) value, with a lower RFP (10.94%) of all the models. Thus, for distinguishing plastid vs non-plastid proteins, both the DIPEP and NCC classifiers could be used efficiently, as both achieve the best MCC of 0.74 with higher accuracies (~87%). To check this further, we plot ROC curves for each of the models as discussed below. Please note: Table
<xref ref-type="table" rid="T6">6</xref>
is the overall performance of prediction modules at 0.0 threshold score of SVM. Individual performances of these classifiers at all values of threshold (-1.2 to 1.2) are available in the Supplementary Material (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Tables S3-S7).</p>
<p>
<bold>ROC curves</bold>
: A plot of ROC curve is a statistical measure, which depicts the relationship between True Positive Rate (TPR) and the False Positive Rate (FPR, i.e. 1-specificity) for a binary classifier system as its discrimination threshold is varied. Figure
<xref ref-type="fig" rid="F3">3</xref>
depicts the ROC curves for each of the five classifiers developed. It shows that the curves for DIPEP and NCC models are closer to the left side of the chart, primarily because they have very high specificity values at all the thresholds. This is a desirable characteristic of ROC curves. We also calculated the AUC values for each model (Figure
<xref ref-type="fig" rid="F3">3</xref>
), which shows that the AUC of 0.79 and 0.80 for the DIPEP and the NCC model, respectively are better than the others. The AUC specifies the probability that when we draw one positive and one negative example at random, the decision function assigns a higher value to the positive than to the negative example. So in phase-I prediction, we judged the DIPEP and NCC models as the best classifiers for predicting plastid vs. non-plastid proteins.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>ROC curve for all five classifiers (AAC, PseAAC, DIPEP, NCC, PhysicoChem) in phase-I prediction; plastid vs. non-plastid proteins identification</bold>
. AUC = Area Under Curve, AAC = amino acid composition, PseAAC = pseudo amino acid composition, DIPEP = dipeptide composition, NCC = N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition, PhysicoChem = Protein physicochemical properties.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S7-3"></graphic>
</fig>
<p>
<bold>Performance on independent set</bold>
: As mentioned in the methodology section, testing on independent datasets is considered to be another approach to judge the overall performance of a classifier, as they are not used in a machine learning process. Our independent set consists of 316 sequences each in positive as well as negative datasets. We run all five classifiers through these datasets separately. Table
<xref ref-type="table" rid="T7">7</xref>
shows that although the sensitivity values for AAC (69.30%), PseAAC (68.35%), NCC (65.82) and Physciochemical (68.35) model are higher than DIPEP (60.44%), they have lower specificity and precision values with a higher RFP. In machine learning, it is very important to have a balance between the sensitivity and specificity values to judge the overall performance of a classifier. The DIPEP model depicts the highest positive prediction rate of 89.25% with a very high specificity of 92.72%, which means that the RFP is the lowest (10.75%) of all the classifiers (Table
<xref ref-type="table" rid="T7">7</xref>
). Accordingly, it would be wise to adjudge the DIPEP-based model as a better performing classifier. Individual performances of these five classifiers on independent test sets at all values of threshold (-1.2 to 1.2) are available in the Supplementary Material (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S13-S17).</p>
<table-wrap id="T7" position="float">
<label>Table 7</label>
<caption>
<p>Overall performance of various feature classifiers on an '
<italic>independent test</italic>
' dataset for the identification of plastid vs. non-plastid proteins (
<italic>phase-I</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Feature type</th>
<th align="center">Sensitivity
<break></break>
(%)</th>
<th align="center">Specificity
<break></break>
(%)</th>
<th align="center">Accuracy
<break></break>
(%)</th>
<th align="center">MCC</th>
<th align="center">Precision (%)</th>
<th align="center">RFP (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>AAC</bold>
</td>
<td align="center">69.30</td>
<td align="center">87.03</td>
<td align="center">78.16</td>
<td align="center">0.57</td>
<td align="center">84.23</td>
<td align="center">15.77</td>
</tr>
<tr>
<td align="left">
<bold>PseAA</bold>
</td>
<td align="center">68.35</td>
<td align="center">87.34</td>
<td align="center">77.85</td>
<td align="center">0.57</td>
<td align="center">84.38</td>
<td align="center">15.62</td>
</tr>
<tr>
<td align="left">
<bold>Dipep</bold>
</td>
<td align="center">60.44</td>
<td align="center">92.72</td>
<td align="center">76.58</td>
<td align="center">0.56</td>
<td align="center">89.25</td>
<td align="center">10.75</td>
</tr>
<tr>
<td align="left">
<bold>NCC</bold>
</td>
<td align="center">65.82</td>
<td align="center">87.97</td>
<td align="center">76.90</td>
<td align="center">0.55</td>
<td align="center">84.55</td>
<td align="center">15.45</td>
</tr>
<tr>
<td align="left">
<bold>Phys-Chem</bold>
</td>
<td align="center">68.35</td>
<td align="center">84.49</td>
<td align="center">76.42</td>
<td align="center">0.54</td>
<td align="center">81.51</td>
<td align="center">18.49</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>*individual performance of these classifiers can be seen in supplementary material; AAC = amino acid composition (best values at ≥ 0.0 threshold), PseAA = Pseudo amino acid composition (best values at ≥ 0.1 threshold), Dipep = Dipeptide composition (best values at ≥ 0.2 threshold), NCC = N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition (
<italic>sequence divided into 3 parts</italic>
), Phys-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, RFP = Rate of False Predictions.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>(iii). Phase-II: SVM-based classification of plastid-type proteins</title>
<p>In the current study, one of our major goals was to predict various plastid-types based on their function. So the proteins that are identified as plastids from phase-I would be further classified into one of its sub classes using the prediction models developed in phase-II. Similar to the phase-I, we first compared the amino acid compositions among various plastid types under study; chloroplast, chromoplast, etioplast and amyloplast (Figure
<xref ref-type="fig" rid="F4">4</xref>
). We assessed the significance of the amino acid compositions using Student's t-test and found that there exists a statistically significant variation in discriminating various plastid types. The p-values of the significance test are listed in Table S1 (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
). Secondly, as done in phase-I, we also compared the physicochemical property-based difference among the plastid types based on grouping the amino acids into seven classes (Table S2). Based on the t-test, we observed that the aliphatic, aromatic, acidic, basic and hydroxylic amino acids have significant variation in most of the plastid types. The above comparison shows that there exists a significant difference in compositions among various sub classes of plastids, which is used as a basis to develop various prediction models in this study.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>A comparative bar-graph of amino acid composition differences among various plastid-types; amyloplast, chromoplast, chloroplast and etioplast proteins</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S7-4"></graphic>
</fig>
<p>The overall performance of the five multi-class models; AAC, PseAAC, DIPEP, NCC and Physicochemical-based is depicted in Table
<xref ref-type="table" rid="T8">8</xref>
. The simple AAC model achieves a sensitivity of about 60% with an accuracy of 77.45% and precision 59%. The MCC is 0.40. Using PseAAC improved the results slightly, predicting plastid sub-classes with an overall accuracy of about 78% and MCC = 0.41. The NCC model show comparable results with an overall accuracy of 78.39%, sensitivity 60.97 % and MCC of 0.42. Comparatively, the physicochemical model achieves less accuracy with a sensitivity of 56.74% and MCC 0.36 only. However, we note that the DIPEP classifier again performs better as compared to the other features with an overall sensitivity of 62.26%, accuracy of 78.60% and a better MCC of 0.44. The precision rate is also high, about 63%. This shows that the DIPEP feature works well in both the phases of plastid prediction and can be used for annotation purposes. The performances of these five classifiers individually on each plastid-type category can be found in the Supplementary Material (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S8-S12).</p>
<table-wrap id="T8" position="float">
<label>Table 8</label>
<caption>
<p>Overall performance of various feature classifiers in 5-fold cross-validation for the classification of diverse plastid-types
<sup>* </sup>
(
<italic>phase-II</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Feature type</th>
<th align="center">Sensitivity
<break></break>
(%)</th>
<th align="center">Specificity
<break></break>
(%)</th>
<th align="center">Accuracy
<break></break>
(%)</th>
<th align="center">MCC</th>
<th align="center">Precision (%)</th>
<th align="center">ER (%)</th>
<th align="center">SVM kernel type</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>AAC</bold>
</td>
<td align="center">60.03</td>
<td align="center">76.05</td>
<td align="center">77.45</td>
<td align="center">0.40</td>
<td align="center">59.00</td>
<td align="center">22.55</td>
<td align="center">RBF (
<italic>γ </italic>
= 246,
<italic>C </italic>
= 1,
<italic>j </italic>
= 2)</td>
</tr>
<tr>
<td align="left">
<bold>PseAA</bold>
</td>
<td align="center">60.72</td>
<td align="center">77.13</td>
<td align="center">78.01</td>
<td align="center">0.41</td>
<td align="center">59.55</td>
<td align="center">21.99</td>
<td align="center">RBF (
<italic>γ </italic>
= 225,
<italic>C </italic>
= 1,
<italic>j </italic>
= 2)</td>
</tr>
<tr>
<td align="left">
<bold>Dipep</bold>
</td>
<td align="center">62.26</td>
<td align="center">75.85</td>
<td align="center">78.60</td>
<td align="center">0.44</td>
<td align="center">62.62</td>
<td align="center">21.40</td>
<td align="center">RBF (
<italic>γ </italic>
= 210,
<italic>C </italic>
= 1,
<italic>j </italic>
= 2)</td>
</tr>
<tr>
<td align="left">
<bold>NCC</bold>
</td>
<td align="center">60.97</td>
<td align="center">77.34</td>
<td align="center">78.39</td>
<td align="center">0.42</td>
<td align="center">58.51</td>
<td align="center">21.61</td>
<td align="center">RBF (
<italic>γ </italic>
= 5,
<italic>C </italic>
= 2,
<italic>j </italic>
= 3)</td>
</tr>
<tr>
<td align="left">
<bold>Phys-Chem</bold>
</td>
<td align="center">56.70</td>
<td align="center">78.01</td>
<td align="center">76.56</td>
<td align="center">0.36</td>
<td align="center">54.15</td>
<td align="center">23.44</td>
<td align="center">RBF (
<italic>γ </italic>
= 37,
<italic>C </italic>
= 9,
<italic>j </italic>
= 1)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>*classification of 4 plastid types: chloroplast, chromoplast, etioplast, amyloplast; individual performance of these classifiers on each class can be seen in supplementary material; AAC = amino acid composition, PseAA = Pseudo amino acid composition, Dipep = Dipeptide composition, NCC = N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition (
<italic>sequence divided into 3 parts</italic>
), Phys-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, ER = Error Rate, RBF = Radial Basis Function of SVM.</p>
</table-wrap-foot>
</table-wrap>
<p>It is worth mentioning that prediction performance falls significantly in phase-II compared to the phase-I prediction process. This might be due to the fact that all of the sub-classes of plastids have common targeting signals (e.g. the transit peptides), as all still belong to one class 'plastids' and thus, it may be very difficult to distinguish their individual patterns by machine learning. However the overall amino acid composition varied significantly among them (Figure
<xref ref-type="fig" rid="F4">4</xref>
, and Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
: Figure S1), which contributed towards respectable prediction accuracies as shown in Table
<xref ref-type="table" rid="T8">8</xref>
. Combined, the results show that the plastid types could be categorized computationally with a statisfactory performance level. Although the models need more refinement, which we plan to do in the future, as, and when, more plastid-type training data is added to various repositories.</p>
<p>
<bold>ROC curves</bold>
: Figure
<xref ref-type="fig" rid="F5">5</xref>
shows the ROC curves for the four sub-classes of plastids. As the DIPEP-based model shows better performance in five-fold cross-validation, we use this classifier to draw ROC plots. As expected, the 'chloroplast' class shows a better ROC plot compared to other classes. A more precise way of evaluating the performance is to calculate the AUC. The closer the area to 0.5, the poorer the method, and the closer to 1.0, the better the method. The AUC for chloroplast (0.80) is the highest of all, which indicates that the "chloro" type plastids are more easily identifiable than other plastids. The other sub-classes
<italic>viz</italic>
. chromoplast (AUC = 0.59), etioplast (AUC = 0.66) and amyloplast (AUC = 0.65) achieve a satisfactory level of area under curve values.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>ROC curves for the best classifier (Dipeptide composition-based) in phase-II prediction, i.e. classification of various plastid types (chloroplast, chromoplast, etioplast, amyloplast)</bold>
. Values in parentheses represent Area Under Curve (AUC).</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S7-5"></graphic>
</fig>
<p>
<bold>Performance on an independent dataset</bold>
: As in phase-I, we also tested the phase-II models on an independent dataset that contains 60 chloroplast sequences, 17 chromoplast, 24 etioplast and 23 amyloplast type proteins. The overall performance of each classifier is depicted in Table
<xref ref-type="table" rid="T9">9</xref>
and the individual performances on each subclass are available in the Supplementary Material (Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Tables S18-S22). As with the 5-fold results, the DIPEP-based model outperformed the other classifiers and achieved an overall sensitivity of 61.29% with an accuracy of about 75%. The rate of positive class prediction, precision (~74%) was also high with the DIPEP feature (Table
<xref ref-type="table" rid="T9">9</xref>
). The NCC-based classifier performed almost at par with the DIPEP model with the same sensitivity and MCC values, although with a lower precision value (60.42%).</p>
<table-wrap id="T9" position="float">
<label>Table 9</label>
<caption>
<p>Overall performance of various feature classifiers on an '
<italic>independent test</italic>
' dataset for the classification of diverse plastid-types
<sup>* </sup>
(
<italic>phase-I</italic>
<italic>I</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Feature type</th>
<th align="center">Sensitivity
<break></break>
(%)</th>
<th align="center">Specificity
<break></break>
(%)</th>
<th align="center">Accuracy
<break></break>
(%)</th>
<th align="center">MCC</th>
<th align="center">Precision (%)</th>
<th align="center">ER
<break></break>
(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>AAC</bold>
</td>
<td align="center">57.26</td>
<td align="center">63.89</td>
<td align="center">72.54</td>
<td align="center">0.30</td>
<td align="center">62.45</td>
<td align="center">27.47</td>
</tr>
<tr>
<td align="left">
<bold>PseAA</bold>
</td>
<td align="center">57.26</td>
<td align="center">63.88</td>
<td align="center">72.48</td>
<td align="center">0.31</td>
<td align="center">65.25</td>
<td align="center">27.52</td>
</tr>
<tr>
<td align="left">
<bold>Dipep</bold>
</td>
<td align="center">61.29</td>
<td align="center">65.96</td>
<td align="center">74.97</td>
<td align="center">0.40</td>
<td align="center">73.97</td>
<td align="center">25.03</td>
</tr>
<tr>
<td align="left">
<bold>NCC</bold>
</td>
<td align="center">61.29</td>
<td align="center">75.82</td>
<td align="center">77.15</td>
<td align="center">0.40</td>
<td align="center">60.42</td>
<td align="center">22.85</td>
</tr>
<tr>
<td align="left">
<bold>Physico-Chem</bold>
</td>
<td align="center">45.97</td>
<td align="center">65.30</td>
<td align="center">66.63</td>
<td align="center">0.14</td>
<td align="center">47.03</td>
<td align="center">33.37</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>*classification of 4 plastid types: chloroplast, chromoplast, etioplast, amyloplast; individual performance of these classifiers on each class can be seen in supplementary material; AAC = amino acid composition, PseAA = Pseudo amino acid composition, Dipep = Dipeptide composition, NCC = N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition (
<italic>sequence divided into 3 parts</italic>
), Physico-Chem = Protein physicochemical properties, MCC = Matthews Correlation Coefficient, ER = Error Rate.</p>
</table-wrap-foot>
</table-wrap>
<p>Overall, the above results suggest that it is possible to categorize plastid proteins into various plastid-types using machine learning approaches with a moderate to high accuracy; the similarity-based module showed very low performance in this study. Although we achieved a significantly high prediction performance in phase-I to distinguish plastid vs. non-plastid proteins, the performances of the models developed in phase-II were not so outstanding. As this is a first attempt to develop prediction models for plastid types based on their function, we achieved a satisfactory level of accuracy. One possible reason for the lower success level is that very few training sequences are available in classes such as chromoplast, etioplast and amyloplast and are almost negligible in other subtypes. Although experimental proteomics approaches have generated a considerable amount of data, more training data is needed to develop highly accurate and efficient prediction models. A second possible reason is that there might be very small differences in sequences among plastid-types, making it very challenging for machine learning modules to distinguish among them. We were able to achieve about 79% prediction accuracy in phase-II with a MCC of 0.44 and precision of 63%, which shows that it is certainly possible to classify plastid-types through machine learning. With the increase in datasets and also by applying novel algorithmic approaches, we will refine these models in future and make available on the PLpred web server.</p>
</sec>
<sec>
<title>(iv). Comparison with existing plastid localization predictors</title>
<p>Although there are no existing tools to predict plastid subtypes, there are some web tools available for predicting the plastid localized proteins from the primary sequence information. We compared the performance of our phase-I models in distinguishing the plastid vs. non-plastid proteins with two widely used tools TargetP [
<xref ref-type="bibr" rid="B52">52</xref>
] and WoLF PSORT [
<xref ref-type="bibr" rid="B53">53</xref>
] along with two other recently developed predictors; YLoc-HiRes [
<xref ref-type="bibr" rid="B54">54</xref>
] and iLoc-Plant [
<xref ref-type="bibr" rid="B55">55</xref>
]. The performance of these methods was compared using the same independent dataset containing 316 plastid and 316 non-plastid proteins (Table
<xref ref-type="table" rid="T10">10</xref>
). As both DIPEP and NCC models from our phase-I achieved almost the same results, we used both these models for comparison; results are presented separately. Results in Table
<xref ref-type="table" rid="T10">10</xref>
show that our method achieves a higher prediction accuracy of about 77% with a MCC of 0.56 as compared to other tools. The MCC achieved by other four tools is between 0.32 and 0.44 with overall prediction accuracies around 66%, which is 11% lower than our method. Within the existing tools, TargetP and Wolf PSORT show better results than YLoc and iLoc-Plant in correctly identifying the plastid proteins by providing higher sensitivity. Although our method outperform all other methods compared in this study by achieving high values for all the evaluation parameters. Thus, PLpred can be used as an efficient tool for predicting plastid proteins.</p>
<table-wrap id="T10" position="float">
<label>Table 10</label>
<caption>
<p>Overall performance comparison of our method with the existing web tools for predicting plastid proteins.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Tools</th>
<th align="center">Sensitivity
<break></break>
(%)</th>
<th align="center">Specificity
<break></break>
(%)</th>
<th align="center">Accuracy
<break></break>
(%)</th>
<th align="center">MCC</th>
<th align="center">Precision (%)</th>
<th align="center">RFP (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>WoLF PSORT</bold>
</td>
<td align="center">56.96</td>
<td align="center">74.76</td>
<td align="center">65.82</td>
<td align="center">0.3223</td>
<td align="center">69.50</td>
<td align="center">30.50</td>
</tr>
<tr>
<td align="left">
<bold>TargetP</bold>
</td>
<td align="center">55.70</td>
<td align="center">85.89</td>
<td align="center">65.97</td>
<td align="center">0.3998</td>
<td align="center">88.44</td>
<td align="center">11.56</td>
</tr>
<tr>
<td align="left">
<bold>iLoc-PLant</bold>
</td>
<td align="center">36.39</td>
<td align="center">98.42</td>
<td align="center">67.41</td>
<td align="center">0.4438</td>
<td align="center">95.83</td>
<td align="center">4.17</td>
</tr>
<tr>
<td align="left">
<bold>YLoc </bold>
(HighRes)</td>
<td align="center">34.81</td>
<td align="center">97.47</td>
<td align="center">66.14</td>
<td align="center">0.4142</td>
<td align="center">93.22</td>
<td align="center">6.78</td>
</tr>
<tr>
<td align="left">
<bold>PLpred </bold>
(DIPEP)</td>
<td align="center">60.44</td>
<td align="center">92.72</td>
<td align="center">76.58</td>
<td align="center">0.56</td>
<td align="center">89.25</td>
<td align="center">10.75</td>
</tr>
<tr>
<td align="left">
<bold>PLpred </bold>
(NCC)</td>
<td align="center">65.82</td>
<td align="center">87.97</td>
<td align="center">76.90</td>
<td align="center">0.55</td>
<td align="center">84.55</td>
<td align="center">15.45</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Performance comparison done on an '
<italic>independent dataset</italic>
' that contains 316 plastid and 316 non-plastid proteins. MCC = Matthews Correlation Coefficient, RFP = Rate of False Predictions, DIPEP = Dipeptide composition-based classifier, NCC = N
<sub>terminal</sub>
-Center-C
<sub>terminal </sub>
composition-based classifier.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusion</title>
<p>Plastids, found in plants and algae, are the major site of manufacture and storage of important chemical compounds used by the cell. In plants, they are differentiated into various forms, depending upon which function they play in the cell such as the chloroplast, chromoplast, etioplast, amyloplast
<italic>etc</italic>
. Recent proteomics approaches have generated an adequate amount of protein data in each of these sub classes. However, large-scale plastid proteomics has become difficult and is nearing saturation due to several constraints as discussed. On the other hand, with the emphasis on genome sequencing and more and more data being generated rapidly, there is a need for accurate computational systems that could be used for genome-wide annotation of various plant genomes. To date, there is no prediction system that can be used to categorize plastid proteins into their various functional types. The current work is an attempt in that direction where we explore homology-based as well as machine learning approaches to classify plastid protein types.</p>
<p>The similarity-based approach showed very weak performance indicating the need and importance of machine learning algorithms. Our benchmark tests on diverse training and testing data showed that it is possible to develop prediction models to distinguishing various plastid-types just from their sequences. Our SVM-based method works in two phases; it first identifies a query protein as plastid or non-plastid with high accuracy and then, further classifies the identified sequences into one of the four plastid subclasses under study. Although we will be further refining the phase-II models with the increase in data availability, the current method should be applicable to the annotation of various available proteomes.</p>
</sec>
<sec>
<title>List of abbreviations</title>
<p>SVM: Support Vector Machine; AAC: Amino acid composition; PseAAC: Pseudo amino acid composition; DIPEP: Dipeptide composition; MCC: Matthews correlation coefficient; RBF: Radial Basis Function; TP: True positive; TN: True negative; FP: False positive; FN: False negative.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing financial interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>RK conceived the study, collected the data, developed algorithms, participated in its design and coordination and wrote the final manuscript. SSS and RV helped in model development, performed the calculations, figures and tables, and helped in drafting the original manuscript. TW helped in data analysis and setting up the prediction tool developed from this study. All authors read and approved the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>Supplementary material; tables</p>
</caption>
<media xlink:href="1471-2105-14-S14-S7-S1.docx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional file 2</title>
<p>Figure S1</p>
</caption>
<media xlink:href="1471-2105-14-S14-S7-S2.docx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>The authors acknowledge the support to this study from faculty start-up funds to RK from NIMFFAB/Department of Biochemistry & Molecular Biology, and support to TW/RK from OSU's Provost Office interdisciplinary grant (#12), the
<italic>i</italic>
CREST Center for Bioinformatics and Computational Biology (
<ext-link ext-link-type="uri" xlink:href="http://icrest.okstate.edu/">http://icrest.okstate.edu/</ext-link>
). Support to RV from USDA-NIFA grant number 2010-85605-20542 is duly acknowledged. We also thank Dr. Ulrich Melcher for reading of a draft manuscript. The authors thank the anonymous referees for help in improving the research article.</p>
</sec>
<sec>
<title>Declaration</title>
<p>Funding for the publication of this article has come from start-up funds account AA-1-51220, OSU.</p>
<p>This article has been published as part of
<italic>BMC Bioinformatics </italic>
Volume 14 Supplement 14, 2013: Proceedings of the Tenth Annual MCBIOS Conference. Discovery in a sea of data. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14">http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Kleffmann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>von Zychlinski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Russenberger</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hirsch-Hoffmann</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gehrig</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gruissem</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Baginsky</surname>
<given-names>S</given-names>
</name>
<article-title>Proteome dynamics during plastid differentiation in rice</article-title>
<source>Plant physiology</source>
<year>2007</year>
<volume>14</volume>
<issue>2</issue>
<fpage>912</fpage>
<lpage>923</lpage>
<pub-id pub-id-type="pmid">17189339</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Cui</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Veeraraghavan</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Richter</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wall</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Jansen</surname>
<given-names>RK</given-names>
</name>
<name>
<surname>Leebens-Mack</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Makalowska</surname>
<given-names>I</given-names>
</name>
<name>
<surname>dePamphilis</surname>
<given-names>CW</given-names>
</name>
<article-title>ChloroplastDB: the Chloroplast Genome Database</article-title>
<source>Nucleic acids research</source>
<year>2006</year>
<volume>14</volume>
<issue>Database</issue>
<fpage>D692</fpage>
<lpage>696</lpage>
<pub-id pub-id-type="pmid">16381961</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Gewolb</surname>
<given-names>J</given-names>
</name>
<article-title>Bioengineering: plant scientists see big potential in tiny plastids</article-title>
<source>Science</source>
<year>2002</year>
<volume>14</volume>
<fpage>258</fpage>
<lpage>259</lpage>
<pub-id pub-id-type="doi">10.1126/science.295.5553.258</pub-id>
<pub-id pub-id-type="pmid">11786623</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Baginsky</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Grossmann</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gruissem</surname>
<given-names>W</given-names>
</name>
<article-title>Proteome analysis of chloroplast mRNA processing and degradation</article-title>
<source>Journal of proteome research</source>
<year>2007</year>
<volume>14</volume>
<issue>2</issue>
<fpage>809</fpage>
<lpage>820</lpage>
<pub-id pub-id-type="doi">10.1021/pr060473q</pub-id>
<pub-id pub-id-type="pmid">17269737</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Siddique</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Grossmann</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Gruissem</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Baginsky</surname>
<given-names>S</given-names>
</name>
<article-title>Proteome analysis of bell pepper (Capsicum annuum L.) chromoplasts</article-title>
<source>Plant & cell physiology</source>
<year>2006</year>
<volume>14</volume>
<issue>12</issue>
<fpage>1663</fpage>
<lpage>1673</lpage>
<pub-id pub-id-type="doi">10.1093/pcp/pcl033</pub-id>
<pub-id pub-id-type="pmid">17098784</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Balmer</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Vensel</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Manieri</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Schurmann</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Hurkman</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>Buchanan</surname>
<given-names>BB</given-names>
</name>
<article-title>A complete ferredoxin/thioredoxin system regulates fundamental processes in amyloplasts</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>2006</year>
<volume>14</volume>
<fpage>2988</fpage>
<lpage>2993</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.0511040103</pub-id>
<pub-id pub-id-type="pmid">16481623</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Andon</surname>
<given-names>NL</given-names>
</name>
<name>
<surname>Hollingworth</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Koller</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Greenland</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Yates</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Haynes</surname>
<given-names>PA</given-names>
</name>
<article-title>Proteomic characterization of wheat amyloplasts using identification of proteins by tandem mass spectrometry</article-title>
<source>Proteomics</source>
<year>2002</year>
<volume>14</volume>
<issue>9</issue>
<fpage>1156</fpage>
<lpage>1168</lpage>
<pub-id pub-id-type="doi">10.1002/1615-9861(200209)2:9<1156::AID-PROT1156>3.0.CO;2-4</pub-id>
<pub-id pub-id-type="pmid">12362334</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Zeng</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>X</given-names>
</name>
<article-title>A proteomic analysis of the chromoplasts isolated from sweet orange fruits [Citrus sinensis (L.) Osbeck]</article-title>
<source>Journal of Experimental Botany</source>
<year>2011</year>
<volume>14</volume>
<issue>15</issue>
<fpage>5297</fpage>
<lpage>5309</lpage>
<pub-id pub-id-type="doi">10.1093/jxb/err140</pub-id>
<pub-id pub-id-type="pmid">21841170</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Balmer</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Vensel</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>DuPont</surname>
<given-names>FM</given-names>
</name>
<name>
<surname>Buchanan</surname>
<given-names>BB</given-names>
</name>
<name>
<surname>Hurkman</surname>
<given-names>WJ</given-names>
</name>
<article-title>Proteome of amyloplasts isolated from developing wheat endosperm presents evidence of broad metabolic capability</article-title>
<source>Journal of Experimental Botany</source>
<year>2006</year>
<volume>14</volume>
<issue>7</issue>
<fpage>1591</fpage>
<lpage>1602</lpage>
<pub-id pub-id-type="doi">10.1093/jxb/erj156</pub-id>
<pub-id pub-id-type="pmid">16595579</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Dupont</surname>
<given-names>FM</given-names>
</name>
<article-title>Metabolic pathways of the wheat (Triticum aestivum) endosperm amyloplast revealed by proteomics</article-title>
<source>BMC Plant Biology</source>
<year>2008</year>
<volume>14</volume>
<fpage>39</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2229-8-39</pub-id>
<pub-id pub-id-type="pmid">18419817</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Barsan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Sanchez-Bel</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rombaldi</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Egea</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Rossignol</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kuntz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zouine</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Latche</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bouzayen</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pech</surname>
<given-names>JC</given-names>
</name>
<article-title>Characteristics of the tomato chromoplast revealed by proteomic analysis</article-title>
<source>Journal of Experimental Botany</source>
<year>2010</year>
<volume>14</volume>
<fpage>2413</fpage>
<lpage>2431</lpage>
<pub-id pub-id-type="doi">10.1093/jxb/erq070</pub-id>
<pub-id pub-id-type="pmid">20363867</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Baginsky</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kleffmann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>von Zychlinski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gruissem</surname>
<given-names>W</given-names>
</name>
<article-title>Analysis of shotgun proteomics and RNA profiling data from
<italic>Arabidopsis thaliana </italic>
chloroplasts</article-title>
<source>J Proteome Res</source>
<year>2005</year>
<volume>14</volume>
<fpage>637</fpage>
<lpage>640</lpage>
<pub-id pub-id-type="doi">10.1021/pr049764u</pub-id>
<pub-id pub-id-type="pmid">15822946</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Kleffmann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hirsch-Hoffmann</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gruissem</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Baginsky</surname>
<given-names>S</given-names>
</name>
<article-title>plprot: a comprehensive proteome database for different plastid types</article-title>
<source>Plant Cell Physiol</source>
<year>2006</year>
<volume>14</volume>
<fpage>432</fpage>
<lpage>436</lpage>
<pub-id pub-id-type="doi">10.1093/pcp/pcj005</pub-id>
<pub-id pub-id-type="pmid">16418230</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Peltier</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Zabrouskov</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Giacomelli</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Rudella</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ytterberg</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Rutschow</surname>
<given-names>H</given-names>
</name>
<name>
<surname>van Wijk</surname>
<given-names>KJ</given-names>
</name>
<article-title>The oligomeric stromal proteome of
<italic>Arabidopsis thaliana </italic>
chloroplasts</article-title>
<source>Mol Cell Proteomics</source>
<year>2006</year>
<volume>14</volume>
<fpage>114</fpage>
<lpage>133</lpage>
<pub-id pub-id-type="pmid">16207701</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Sun</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Zybailov</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Majeran</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Friso</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Olinares</surname>
<given-names>PD</given-names>
</name>
<name>
<surname>van Wijk</surname>
<given-names>KJ</given-names>
</name>
<article-title>PPDB, the Plant Proteomics Database at Cornell</article-title>
<source>Nucleic acids research</source>
<year>2009</year>
<volume>14</volume>
<issue>Database</issue>
<fpage>D969</fpage>
<lpage>974</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn654</pub-id>
<pub-id pub-id-type="pmid">18832363</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Emanuelsson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Brunak</surname>
<given-names>S</given-names>
</name>
<name>
<surname>von Heijne</surname>
<given-names>G</given-names>
</name>
<article-title>Predicting subcellular localization of proteins based on their N-terminal amino acid sequence</article-title>
<source>J Mol Biol</source>
<year>2000</year>
<volume>14</volume>
<fpage>1005</fpage>
<lpage>1016</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.2000.3903</pub-id>
<pub-id pub-id-type="pmid">10891285</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Kleffmann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Russenberger</surname>
<given-names>D</given-names>
</name>
<name>
<surname>von Zychlinski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Christopher</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Sjolander</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Gruissem</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Baginsky</surname>
<given-names>S</given-names>
</name>
<article-title>The
<italic>Arabidopsis thaliana </italic>
chloroplast proteome reveals pathway abundance and novel protein functions</article-title>
<source>Current Biology</source>
<year>2004</year>
<volume>14</volume>
<fpage>354</fpage>
<lpage>362</lpage>
<pub-id pub-id-type="doi">10.1016/j.cub.2004.02.039</pub-id>
<pub-id pub-id-type="pmid">15028209</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Richly</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Leister</surname>
<given-names>D</given-names>
</name>
<article-title>An improved prediction of chloroplast proteins reveals diversities and commonalities in the chloroplast proteomes of Arabidopsis and rice</article-title>
<source>Gene</source>
<year>2004</year>
<volume>14</volume>
<fpage>11</fpage>
<lpage>16</lpage>
<pub-id pub-id-type="pmid">15033524</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Nair</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rost</surname>
<given-names>B</given-names>
</name>
<article-title>Mimicking cellular sorting improves prediction of subcellular localization</article-title>
<source>J Mol Biol</source>
<year>2005</year>
<volume>14</volume>
<fpage>85</fpage>
<lpage>100</lpage>
<pub-id pub-id-type="doi">10.1016/j.jmb.2005.02.025</pub-id>
<pub-id pub-id-type="pmid">15808855</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Jarvis</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Robinson</surname>
<given-names>C</given-names>
</name>
<article-title>Mechanisms of protein import and routing in chloroplasts</article-title>
<source>Current Biology</source>
<year>2004</year>
<volume>14</volume>
<fpage>R1064</fpage>
<lpage>R1077</lpage>
<pub-id pub-id-type="doi">10.1016/j.cub.2004.11.049</pub-id>
<pub-id pub-id-type="pmid">15620643</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>von Zychlinski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kleffmann</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Krishnamurthy</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sjölander</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Baginsky</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gruissem</surname>
<given-names>W</given-names>
</name>
<article-title>Proteome analysis of the rice etioplast: metabolic and regulatory networks and novel protein functions</article-title>
<source>Mol Cell Proteomics</source>
<year>2005</year>
<volume>14</volume>
<issue>8</issue>
<fpage>1072</fpage>
<lpage>1084</lpage>
<pub-id pub-id-type="doi">10.1074/mcp.M500018-MCP200</pub-id>
<pub-id pub-id-type="pmid">15901827</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="other">
<name>
<surname>Dondoshansky</surname>
<given-names>WY I</given-names>
</name>
<article-title>BLASTCLUST - BLAST score-based single-linkage clustering</article-title>
<year>2000</year>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>HB</given-names>
</name>
<article-title>Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers</article-title>
<source>Journal of Proteome Research</source>
<year>2006</year>
<volume>14</volume>
<fpage>1888</fpage>
<lpage>1897</lpage>
<pub-id pub-id-type="doi">10.1021/pr060167c</pub-id>
<pub-id pub-id-type="pmid">16889410</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>HB</given-names>
</name>
<article-title>Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization</article-title>
<source>Biochem Biophys Res Commun</source>
<year>2006</year>
<volume>14</volume>
<fpage>150</fpage>
<lpage>157</lpage>
<pub-id pub-id-type="doi">10.1016/j.bbrc.2006.06.059</pub-id>
<pub-id pub-id-type="pmid">16808903</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Briesemeister</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Blum</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Brady</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lam</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Kohlbacher</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Shatkay</surname>
<given-names>H</given-names>
</name>
<article-title>SherLoc2: A High-Accuracy Hybrid Method for Predicting Subcellular Localization of Proteins</article-title>
<source>Journal of Proteome Research</source>
<year>2009</year>
<volume>14</volume>
<fpage>5363</fpage>
<lpage>5366</lpage>
<pub-id pub-id-type="doi">10.1021/pr900665y</pub-id>
<pub-id pub-id-type="pmid">19764776</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>Yu</surname>
<given-names>CS</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>YC</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>CH</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>JK</given-names>
</name>
<article-title>Prediction of protein subcellular localization</article-title>
<source>Proteins: Structure, Function, and Bioinformatics</source>
<year>2006</year>
<volume>14</volume>
<issue>3</issue>
<fpage>643</fpage>
<lpage>651</lpage>
<pub-id pub-id-type="doi">10.1002/prot.21018</pub-id>
<pub-id pub-id-type="pmid">16752418</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Su</surname>
<given-names>EC</given-names>
</name>
<name>
<surname>Chiu</surname>
<given-names>HS</given-names>
</name>
<name>
<surname>Lo</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>JK</given-names>
</name>
<name>
<surname>Sung</surname>
<given-names>TY</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>WL</given-names>
</name>
<article-title>Protein subcellular localization prediction based on compartment-specific features and structure conservation</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>14</volume>
<fpage>330</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-8-330</pub-id>
<pub-id pub-id-type="pmid">17825110</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="journal">
<name>
<surname>Casadio</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Martelli</surname>
<given-names>PL</given-names>
</name>
<name>
<surname>Pierleoni</surname>
<given-names>A</given-names>
</name>
<article-title>The prediction of protein subcellular localization from sequence: a shortcut to functional genome annotation</article-title>
<source>Briefings in Functional Genomics</source>
<year>2008</year>
<volume>14</volume>
<issue>1</issue>
<fpage>63</fpage>
<lpage>73</lpage>
<pub-id pub-id-type="doi">10.1093/bfgp/eln003</pub-id>
<pub-id pub-id-type="pmid">18283051</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="journal">
<name>
<surname>Kaundal</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Saini</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>PX</given-names>
</name>
<article-title>Combining Machine Learning and Homology-Based Approaches to Accurately Predict Subcellular Localization in Arabidopsis</article-title>
<source>Plant Physiology</source>
<year>2010</year>
<volume>14</volume>
<fpage>36</fpage>
<lpage>54</lpage>
<pub-id pub-id-type="doi">10.1104/pp.110.156851</pub-id>
<pub-id pub-id-type="pmid">20647376</pub-id>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Kaundal</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Raghava</surname>
<given-names>GPS</given-names>
</name>
<article-title>RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information</article-title>
<source>Proteomics</source>
<year>2009</year>
<volume>14</volume>
<issue>9</issue>
<fpage>2324</fpage>
<lpage>2342</lpage>
<pub-id pub-id-type="doi">10.1002/pmic.200700597</pub-id>
<pub-id pub-id-type="pmid">19402042</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal">
<name>
<surname>Sahu</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Panda</surname>
<given-names>G</given-names>
</name>
<article-title>A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction</article-title>
<source>Computational Biology and Chemistry</source>
<year>2010</year>
<volume>14</volume>
<fpage>320</fpage>
<lpage>327</lpage>
<pub-id pub-id-type="doi">10.1016/j.compbiolchem.2010.09.002</pub-id>
<pub-id pub-id-type="pmid">21106461</pub-id>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Garg</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bhasin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Raghava</surname>
<given-names>GPS</given-names>
</name>
<article-title>Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search</article-title>
<source>Journal of Biological Chemistry</source>
<year>2005</year>
<volume>14</volume>
<fpage>14427</fpage>
<lpage>14432</lpage>
<pub-id pub-id-type="doi">10.1074/jbc.M411789200</pub-id>
<pub-id pub-id-type="pmid">15647269</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal">
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
<article-title>Prediction of protein cellular attributes using pseudo amino acid composition</article-title>
<source>Proteins</source>
<year>2001</year>
<volume>14</volume>
<fpage>246</fpage>
<lpage>255</lpage>
<pub-id pub-id-type="doi">10.1002/prot.1035</pub-id>
<pub-id pub-id-type="pmid">11288174</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal">
<name>
<surname>Jiang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Gu</surname>
<given-names>Q</given-names>
</name>
<article-title>Using the concept of Chou's pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy</article-title>
<source>Protein Peptide Lett</source>
<year>2001</year>
<volume>14</volume>
<fpage>392</fpage>
<lpage>396</lpage>
<pub-id pub-id-type="pmid">18473953</pub-id>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>YS</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
<article-title>Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern</article-title>
<source>J Theor Biol</source>
<year>2008</year>
<volume>14</volume>
<fpage>186</fpage>
<lpage>193</lpage>
<pub-id pub-id-type="doi">10.1016/j.jtbi.2007.09.014</pub-id>
<pub-id pub-id-type="pmid">17959199</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>TL</surname>
<given-names>M</given-names>
</name>
<name>
<surname>AA</surname>
<given-names>S</given-names>
</name>
<name>
<surname>J</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Z</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>W</surname>
<given-names>M</given-names>
</name>
<name>
<surname>DJ</surname>
<given-names>L</given-names>
</name>
<article-title>Gapped Blast and PSI-Blast: a new generation of protein database search programs</article-title>
<source>Nucleic Acids Res</source>
<year>1997</year>
<volume>14</volume>
<fpage>3389</fpage>
<lpage>3402</lpage>
<pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id>
<pub-id pub-id-type="pmid">9254694</pub-id>
</mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="journal">
<name>
<surname>Cortes</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Vapnik</surname>
<given-names>V</given-names>
</name>
<article-title>Support vector networks</article-title>
<source>Machine Learning</source>
<year>1995</year>
<volume>14</volume>
<fpage>273</fpage>
<lpage>293</lpage>
</mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="book">
<name>
<surname>Vapnik</surname>
<given-names>V</given-names>
</name>
<source>The Nature of Statistical Learning Theory</source>
<year>1995</year>
<publisher-name>Springer, New York</publisher-name>
</mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="journal">
<name>
<surname>Hua</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Z</given-names>
</name>
<article-title>Support vector machine approach for protein subcellular localization prediction</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>14</volume>
<fpage>721</fpage>
<lpage>728</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.8.721</pub-id>
<pub-id pub-id-type="pmid">11524373</pub-id>
</mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="journal">
<name>
<surname>Park</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Kanehisa</surname>
<given-names>M</given-names>
</name>
<article-title>Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>14</volume>
<fpage>1656</fpage>
<lpage>1663</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg222</pub-id>
<pub-id pub-id-type="pmid">12967962</pub-id>
</mixed-citation>
</ref>
<ref id="B41">
<mixed-citation publication-type="journal">
<name>
<surname>Bhasin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Raghava</surname>
<given-names>GPS</given-names>
</name>
<article-title>ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST</article-title>
<source>Nucleic Acids Research</source>
<year>2004</year>
<volume>14</volume>
<fpage>414</fpage>
<lpage>419</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh350</pub-id>
<pub-id pub-id-type="pmid">15215421</pub-id>
</mixed-citation>
</ref>
<ref id="B42">
<mixed-citation publication-type="journal">
<name>
<surname>Xie</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>H</given-names>
</name>
<article-title>LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST</article-title>
<source>Nucleic Acids Research</source>
<year>2005</year>
<volume>14</volume>
<fpage>105</fpage>
<lpage>110</lpage>
<pub-id pub-id-type="pmid">15980436</pub-id>
</mixed-citation>
</ref>
<ref id="B43">
<mixed-citation publication-type="journal">
<name>
<surname>Brown</surname>
<given-names>MPS</given-names>
</name>
<name>
<surname>Grundy</surname>
<given-names>WN</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Cristianini</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sugnet</surname>
<given-names>CW</given-names>
</name>
<name>
<surname>Furey</surname>
<given-names>TS</given-names>
</name>
<name>
<surname>Ares</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
<article-title>Knowledge-based analysis of microarray gene expression data by using support vector machines</article-title>
<source>Proc Natl Acad Sci</source>
<year>2000</year>
<volume>14</volume>
<fpage>262</fpage>
<lpage>267</lpage>
<pub-id pub-id-type="doi">10.1073/pnas.97.1.262</pub-id>
<pub-id pub-id-type="pmid">10618406</pub-id>
</mixed-citation>
</ref>
<ref id="B44">
<mixed-citation publication-type="journal">
<name>
<surname>Ward</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>McGuffin</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Buxton</surname>
<given-names>BF</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>DT</given-names>
</name>
<article-title>Secondary structure prediction with support vector machines</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>14</volume>
<fpage>1650</fpage>
<lpage>1655</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg223</pub-id>
<pub-id pub-id-type="pmid">12967961</pub-id>
</mixed-citation>
</ref>
<ref id="B45">
<mixed-citation publication-type="journal">
<name>
<surname>Ding</surname>
<given-names>CHQ</given-names>
</name>
<name>
<surname>Dubchak</surname>
<given-names>I</given-names>
</name>
<article-title>Multi-class protein fold recognition using support vector machines and neural networks</article-title>
<source>Bioinformatics</source>
<year>2001</year>
<volume>14</volume>
<issue>4</issue>
<fpage>349</fpage>
<lpage>358</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/17.4.349</pub-id>
<pub-id pub-id-type="pmid">11301304</pub-id>
</mixed-citation>
</ref>
<ref id="B46">
<mixed-citation publication-type="journal">
<name>
<surname>Kaundal</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Kapoor</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Raghava</surname>
<given-names>GPS</given-names>
</name>
<article-title>Machine learning techniques in disease forecasting: a case study on rice blast prediction</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>14</volume>
<fpage>485</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-7-485</pub-id>
<pub-id pub-id-type="pmid">17083731</pub-id>
</mixed-citation>
</ref>
<ref id="B47">
<mixed-citation publication-type="journal">
<name>
<surname>Cai</surname>
<given-names>YD</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
<article-title>Support vector machines for predicting membrane protein types by using functional domain composition</article-title>
<source>J Biophys</source>
<year>2003</year>
<volume>14</volume>
<fpage>3257</fpage>
<lpage>3263</lpage>
<pub-id pub-id-type="doi">10.1016/S0006-3495(03)70050-2</pub-id>
<pub-id pub-id-type="pmid">12719255</pub-id>
</mixed-citation>
</ref>
<ref id="B48">
<mixed-citation publication-type="book">
<name>
<surname>Joachims</surname>
<given-names>T</given-names>
</name>
<person-group person-group-type="editor">Schölkopf B, Burges C, Smola A</person-group>
<source>Advances in Kernel Methods - Support Vector Learning</source>
<year>1999</year>
<publisher-name>MIT-Press, Massachusetts</publisher-name>
<fpage>41</fpage>
<lpage>56</lpage>
</mixed-citation>
</ref>
<ref id="B49">
<mixed-citation publication-type="journal">
<name>
<surname>Cedano</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Aloy</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Perez-Pons</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Querol</surname>
<given-names>E</given-names>
</name>
<article-title>Relation Between Amino Acid Composition and Cellular Location of Proteins</article-title>
<source>Journal of Molecular Biology</source>
<year>1997</year>
<volume>14</volume>
<fpage>594</fpage>
<lpage>600</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.1996.0804</pub-id>
<pub-id pub-id-type="pmid">9067612</pub-id>
</mixed-citation>
</ref>
<ref id="B50">
<mixed-citation publication-type="journal">
<name>
<surname>Benedito</surname>
<given-names>VA</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Wandrey</surname>
<given-names>M</given-names>
</name>
<name>
<surname>He</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kaundal</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Torres-Jerez</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Gomez</surname>
<given-names>SK</given-names>
</name>
<name>
<surname>Harrison</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Udvardi</surname>
<given-names>M</given-names>
</name>
<etal></etal>
<article-title>Genomic inventory and transcriptional analysis of
<italic>Medicago truncatula </italic>
transporters</article-title>
<source>Plant Physiology</source>
<year>2010</year>
<volume>14</volume>
<issue>3</issue>
<fpage>1716</fpage>
<lpage>1730</lpage>
<pub-id pub-id-type="doi">10.1104/pp.109.148684</pub-id>
<pub-id pub-id-type="pmid">20023147</pub-id>
</mixed-citation>
</ref>
<ref id="B51">
<mixed-citation publication-type="journal">
<name>
<surname>Andrade</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>O'Donoghue</surname>
<given-names>SI</given-names>
</name>
<name>
<surname>Rost</surname>
<given-names>B</given-names>
</name>
<article-title>Adaptation of Protein Surfaces to Subcellular Location</article-title>
<source>Journal of Molecular Biology</source>
<year>1998</year>
<volume>14</volume>
<fpage>517</fpage>
<lpage>525</lpage>
<pub-id pub-id-type="doi">10.1006/jmbi.1997.1498</pub-id>
<pub-id pub-id-type="pmid">9512720</pub-id>
</mixed-citation>
</ref>
<ref id="B52">
<mixed-citation publication-type="journal">
<name>
<surname>Emanuelsson</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Brunak</surname>
<given-names>S</given-names>
</name>
<name>
<surname>von Heijne</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>H</given-names>
</name>
<article-title>Locating proteins in the cell using TargetP, SignalP and related tools</article-title>
<source>Nature Protocols</source>
<year>2007</year>
<volume>14</volume>
<fpage>953</fpage>
<lpage>971</lpage>
<pub-id pub-id-type="doi">10.1038/nprot.2007.131</pub-id>
<pub-id pub-id-type="pmid">17446895</pub-id>
</mixed-citation>
</ref>
<ref id="B53">
<mixed-citation publication-type="journal">
<name>
<surname>Horton</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Obayashi</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Fujita</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Harada</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Adams-Collier</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Nakai</surname>
<given-names>K</given-names>
</name>
<article-title>WoLF PSORT: protein localization predictor</article-title>
<source>Nucleic Acids Research</source>
<year>2007</year>
<volume>14</volume>
<fpage>W585</fpage>
<lpage>W587</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkm259</pub-id>
<pub-id pub-id-type="pmid">17517783</pub-id>
</mixed-citation>
</ref>
<ref id="B54">
<mixed-citation publication-type="journal">
<name>
<surname>Briesemeister</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rahnenführer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kohlbacher</surname>
<given-names>O</given-names>
</name>
<article-title>YLoc - an interpretable web server for predicting subcellular localization</article-title>
<source>Nucleic Acids Research</source>
<year>2010</year>
<volume>14</volume>
<fpage>W497</fpage>
<lpage>W502</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkq477</pub-id>
<pub-id pub-id-type="pmid">20507917</pub-id>
</mixed-citation>
</ref>
<ref id="B55">
<mixed-citation publication-type="journal">
<name>
<surname>Wu</surname>
<given-names>ZC</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
<article-title>iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites</article-title>
<source>Molecular Biosystems</source>
<year>2011</year>
<volume>14</volume>
<fpage>3287</fpage>
<lpage>3297</lpage>
<pub-id pub-id-type="doi">10.1039/c1mb05232b</pub-id>
<pub-id pub-id-type="pmid">21984117</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Bois/explor/OrangerV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A890 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000A890 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Bois
   |area=    OrangerV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Sat Dec 3 17:11:04 2016. Site generation: Wed Mar 6 18:18:32 2024