Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences

Identifieur interne : 001069 ( Pmc/Corpus ); précédent : 001068; suivant : 001070

BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences

Auteurs : Jianzhao Gao ; Eshel Faraggi ; Yaoqi Zhou ; Jishou Ruan ; Lukasz Kurgan

Source :

RBID : PMC:3384636

Abstract

Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods.


Url:
DOI: 10.1371/journal.pone.0040104
PubMed: 22761950
PubMed Central: 3384636

Links to Exploration step

PMC:3384636

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences</title>
<author>
<name sortKey="Gao, Jianzhao" sort="Gao, Jianzhao" uniqKey="Gao J" first="Jianzhao" last="Gao">Jianzhao Gao</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Faraggi, Eshel" sort="Faraggi, Eshel" uniqKey="Faraggi E" first="Eshel" last="Faraggi">Eshel Faraggi</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhou, Yaoqi" sort="Zhou, Yaoqi" uniqKey="Zhou Y" first="Yaoqi" last="Zhou">Yaoqi Zhou</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ruan, Jishou" sort="Ruan, Jishou" uniqKey="Ruan J" first="Jishou" last="Ruan">Jishou Ruan</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff4">
<addr-line>State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, People's Republic of China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kurgan, Lukasz" sort="Kurgan, Lukasz" uniqKey="Kurgan L" first="Lukasz" last="Kurgan">Lukasz Kurgan</name>
<affiliation>
<nlm:aff id="aff5">
<addr-line>Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22761950</idno>
<idno type="pmc">3384636</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3384636</idno>
<idno type="RBID">PMC:3384636</idno>
<idno type="doi">10.1371/journal.pone.0040104</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">001069</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001069</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences</title>
<author>
<name sortKey="Gao, Jianzhao" sort="Gao, Jianzhao" uniqKey="Gao J" first="Jianzhao" last="Gao">Jianzhao Gao</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Faraggi, Eshel" sort="Faraggi, Eshel" uniqKey="Faraggi E" first="Eshel" last="Faraggi">Eshel Faraggi</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Zhou, Yaoqi" sort="Zhou, Yaoqi" uniqKey="Zhou Y" first="Yaoqi" last="Zhou">Yaoqi Zhou</name>
<affiliation>
<nlm:aff id="aff2">
<addr-line>School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff3">
<addr-line>Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ruan, Jishou" sort="Ruan, Jishou" uniqKey="Ruan J" first="Jishou" last="Ruan">Jishou Ruan</name>
<affiliation>
<nlm:aff id="aff1">
<addr-line>School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China</addr-line>
</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="aff4">
<addr-line>State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, People's Republic of China</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kurgan, Lukasz" sort="Kurgan, Lukasz" uniqKey="Kurgan L" first="Lukasz" last="Kurgan">Lukasz Kurgan</name>
<affiliation>
<nlm:aff id="aff5">
<addr-line>Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, P" uniqKey="Chen P">P Chen</name>
</author>
<author>
<name sortKey="Rayner, S" uniqKey="Rayner S">S Rayner</name>
</author>
<author>
<name sortKey="Hu, Kh" uniqKey="Hu K">KH Hu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beck, A" uniqKey="Beck A">A Beck</name>
</author>
<author>
<name sortKey="Klinguer Hamour, C" uniqKey="Klinguer Hamour C">C Klinguer-Hamour</name>
</author>
<author>
<name sortKey="Bussat, Mc" uniqKey="Bussat M">MC Bussat</name>
</author>
<author>
<name sortKey="Champion, T" uniqKey="Champion T">T Champion</name>
</author>
<author>
<name sortKey="Haeuw, Jf" uniqKey="Haeuw J">JF Haeuw</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X Yang</name>
</author>
<author>
<name sortKey="Yu, X" uniqKey="Yu X">X Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tong, Jc" uniqKey="Tong J">JC Tong</name>
</author>
<author>
<name sortKey="Tan, Tw" uniqKey="Tan T">TW Tan</name>
</author>
<author>
<name sortKey="Ranganathan, S" uniqKey="Ranganathan S">S Ranganathan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blythe, Mj" uniqKey="Blythe M">MJ Blythe</name>
</author>
<author>
<name sortKey="Flower, Dr" uniqKey="Flower D">DR Flower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pellequer, Jl" uniqKey="Pellequer J">JL Pellequer</name>
</author>
<author>
<name sortKey="Westhof, E" uniqKey="Westhof E">E Westhof</name>
</author>
<author>
<name sortKey="Van Regenmortel, Mh" uniqKey="Van Regenmortel M">MH Van Regenmortel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Flower, Dr" uniqKey="Flower D">DR Flower</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hopp, Tp" uniqKey="Hopp T">TP Hopp</name>
</author>
<author>
<name sortKey="Woods, Kr" uniqKey="Woods K">KR Woods</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Welling, Gw" uniqKey="Welling G">GW Welling</name>
</author>
<author>
<name sortKey="Weijer, Wj" uniqKey="Weijer W">WJ Weijer</name>
</author>
<author>
<name sortKey="Van Der Zee, R" uniqKey="Van Der Zee R">R van der Zee</name>
</author>
<author>
<name sortKey="Welling Wester, S" uniqKey="Welling Wester S">S Welling-Wester</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karplus, Pa" uniqKey="Karplus P">PA Karplus</name>
</author>
<author>
<name sortKey="Schulz, Ge" uniqKey="Schulz G">GE Schulz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Parker, Jm" uniqKey="Parker J">JM Parker</name>
</author>
<author>
<name sortKey="Guo, D" uniqKey="Guo D">D Guo</name>
</author>
<author>
<name sortKey="Hodges, Rs" uniqKey="Hodges R">RS Hodges</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kolaskar, As" uniqKey="Kolaskar A">AS Kolaskar</name>
</author>
<author>
<name sortKey="Tongaonkar, Pc" uniqKey="Tongaonkar P">PC Tongaonkar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pellequer, Jl" uniqKey="Pellequer J">JL Pellequer</name>
</author>
<author>
<name sortKey="Westhof, E" uniqKey="Westhof E">E Westhof</name>
</author>
<author>
<name sortKey="Van Regenmortel, Mh" uniqKey="Van Regenmortel M">MH Van Regenmortel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pellequer, Jl" uniqKey="Pellequer J">JL Pellequer</name>
</author>
<author>
<name sortKey="Westhof, E" uniqKey="Westhof E">E Westhof</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alix, Aj" uniqKey="Alix A">AJ Alix</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Odorico, M" uniqKey="Odorico M">M Odorico</name>
</author>
<author>
<name sortKey="Pellequer, Jl" uniqKey="Pellequer J">JL Pellequer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Larsen, Je" uniqKey="Larsen J">JE Larsen</name>
</author>
<author>
<name sortKey="Lund, O" uniqKey="Lund O">O Lund</name>
</author>
<author>
<name sortKey="Nielsen, M" uniqKey="Nielsen M">M Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sollner, J" uniqKey="Sollner J">J Söllner</name>
</author>
<author>
<name sortKey="Mayer, B" uniqKey="Mayer B">B Mayer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saha, S" uniqKey="Saha S">S Saha</name>
</author>
<author>
<name sortKey="Raghava, Gp" uniqKey="Raghava G">GP Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Liu, H" uniqKey="Liu H">H Liu</name>
</author>
<author>
<name sortKey="Yang, J" uniqKey="Yang J">J Yang</name>
</author>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="El Manzalawy, Y" uniqKey="El Manzalawy Y">Y El-Manzalawy</name>
</author>
<author>
<name sortKey="Dobbs, D" uniqKey="Dobbs D">D Dobbs</name>
</author>
<author>
<name sortKey="Honavar, V" uniqKey="Honavar V">V Honavar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sweredoski, Mj" uniqKey="Sweredoski M">MJ Sweredoski</name>
</author>
<author>
<name sortKey="Baldi, P" uniqKey="Baldi P">P Baldi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wee, Lj" uniqKey="Wee L">LJ Wee</name>
</author>
<author>
<name sortKey="Simarmata, D" uniqKey="Simarmata D">D Simarmata</name>
</author>
<author>
<name sortKey="Kam, Yw" uniqKey="Kam Y">YW Kam</name>
</author>
<author>
<name sortKey="Ng, Lf" uniqKey="Ng L">LF Ng</name>
</author>
<author>
<name sortKey="Tong, Jc" uniqKey="Tong J">JC Tong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, Sf" uniqKey="Altschul S">SF Altschul</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Sch Ffer, Aa" uniqKey="Sch Ffer A">AA Schäffer</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ansari, Hr" uniqKey="Ansari H">HR Ansari</name>
</author>
<author>
<name sortKey="Raghava, Gp" uniqKey="Raghava G">GP Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kulkarni Kale, U" uniqKey="Kulkarni Kale U">U Kulkarni-Kale</name>
</author>
<author>
<name sortKey="Bhosle, S" uniqKey="Bhosle S">S Bhosle</name>
</author>
<author>
<name sortKey="Kolaskar, As" uniqKey="Kolaskar A">AS Kolaskar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haste Andersen, P" uniqKey="Haste Andersen P">P Haste Andersen</name>
</author>
<author>
<name sortKey="Nielsen, M" uniqKey="Nielsen M">M Nielsen</name>
</author>
<author>
<name sortKey="Lund, O" uniqKey="Lund O">O Lund</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, J" uniqKey="Sun J">J Sun</name>
</author>
<author>
<name sortKey="Wu, D" uniqKey="Wu D">D Wu</name>
</author>
<author>
<name sortKey="Xu, T" uniqKey="Xu T">T Xu</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author>
<name sortKey="Xu, X" uniqKey="Xu X">X Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sweredoski, Mj" uniqKey="Sweredoski M">MJ Sweredoski</name>
</author>
<author>
<name sortKey="Baldi, P" uniqKey="Baldi P">P Baldi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liang, S" uniqKey="Liang S">S Liang</name>
</author>
<author>
<name sortKey="Zheng, D" uniqKey="Zheng D">D Zheng</name>
</author>
<author>
<name sortKey="Standley, Dm" uniqKey="Standley D">DM Standley</name>
</author>
<author>
<name sortKey="Yao, B" uniqKey="Yao B">B Yao</name>
</author>
<author>
<name sortKey="Zacharias, M" uniqKey="Zacharias M">M Zacharias</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, W" uniqKey="Zhang W">W Zhang</name>
</author>
<author>
<name sortKey="Xiong, Y" uniqKey="Xiong Y">Y Xiong</name>
</author>
<author>
<name sortKey="Zhao, M" uniqKey="Zhao M">M Zhao</name>
</author>
<author>
<name sortKey="Zou, H" uniqKey="Zou H">H Zou</name>
</author>
<author>
<name sortKey="Ye, X" uniqKey="Ye X">X Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, R" uniqKey="Liu R">R Liu</name>
</author>
<author>
<name sortKey="Hu, J" uniqKey="Hu J">J Hu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, L" uniqKey="Zhao L">L Zhao</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rubinstein, Nd" uniqKey="Rubinstein N">ND Rubinstein</name>
</author>
<author>
<name sortKey="Mayrose, I" uniqKey="Mayrose I">I Mayrose</name>
</author>
<author>
<name sortKey="Martz, E" uniqKey="Martz E">E Martz</name>
</author>
<author>
<name sortKey="Pupko, T" uniqKey="Pupko T">T Pupko</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rubinstein, Nd" uniqKey="Rubinstein N">ND Rubinstein</name>
</author>
<author>
<name sortKey="Mayrose, I" uniqKey="Mayrose I">I Mayrose</name>
</author>
<author>
<name sortKey="Pupko, T" uniqKey="Pupko T">T Pupko</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Faraggi, E" uniqKey="Faraggi E">E Faraggi</name>
</author>
<author>
<name sortKey="Xue, B" uniqKey="Xue B">B Xue</name>
</author>
<author>
<name sortKey="Zhou, Y" uniqKey="Zhou Y">Y Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dor, O" uniqKey="Dor O">O Dor</name>
</author>
<author>
<name sortKey="Zhou, Y" uniqKey="Zhou Y">Y Zhou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saha, S" uniqKey="Saha S">S Saha</name>
</author>
<author>
<name sortKey="Bhasin, M" uniqKey="Bhasin M">M Bhasin</name>
</author>
<author>
<name sortKey="Raghava, Gp" uniqKey="Raghava G">GP Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, J" uniqKey="Huang J">J Huang</name>
</author>
<author>
<name sortKey="Honda, W" uniqKey="Honda W">W Honda</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ahmad, S" uniqKey="Ahmad S">S Ahmad</name>
</author>
<author>
<name sortKey="Gromiha, Mm" uniqKey="Gromiha M">MM Gromiha</name>
</author>
<author>
<name sortKey="Sarai, A" uniqKey="Sarai A">A Sarai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ahmad, S" uniqKey="Ahmad S">S Ahmad</name>
</author>
<author>
<name sortKey="Gromiha, Mm" uniqKey="Gromiha M">MM Gromiha</name>
</author>
<author>
<name sortKey="Sarai, A" uniqKey="Sarai A">A Sarai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, K" uniqKey="Wang K">K Wang</name>
</author>
<author>
<name sortKey="Samudrala, R" uniqKey="Samudrala R">R Samudrala</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mizianty, Mj" uniqKey="Mizianty M">MJ Mizianty</name>
</author>
<author>
<name sortKey="Kurgan, L" uniqKey="Kurgan L">L Kurgan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Mizianty, Mj" uniqKey="Mizianty M">MJ Mizianty</name>
</author>
<author>
<name sortKey="Kurgan, L" uniqKey="Kurgan L">L Kurgan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mizianty, Mj" uniqKey="Mizianty M">MJ Mizianty</name>
</author>
<author>
<name sortKey="Stach, W" uniqKey="Stach W">W Stach</name>
</author>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Kedarisetti, Kd" uniqKey="Kedarisetti K">KD Kedarisetti</name>
</author>
<author>
<name sortKey="Disfani, Fm" uniqKey="Disfani F">FM Disfani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mizianty, Mj" uniqKey="Mizianty M">MJ Mizianty</name>
</author>
<author>
<name sortKey="Kurgan, L" uniqKey="Kurgan L">L Kurgan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Mizianty, Mj" uniqKey="Mizianty M">MJ Mizianty</name>
</author>
<author>
<name sortKey="Kurgan, L" uniqKey="Kurgan L">L Kurgan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, T" uniqKey="Zhang T">T Zhang</name>
</author>
<author>
<name sortKey="Zhang, H" uniqKey="Zhang H">H Zhang</name>
</author>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Shen, S" uniqKey="Shen S">S Shen</name>
</author>
<author>
<name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurgan, L" uniqKey="Kurgan L">L Kurgan</name>
</author>
<author>
<name sortKey="Miri Disfani, F" uniqKey="Miri Disfani F">F Miri Disfani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mohan, A" uniqKey="Mohan A">A Mohan</name>
</author>
<author>
<name sortKey="Oldfield, Cj" uniqKey="Oldfield C">CJ Oldfield</name>
</author>
<author>
<name sortKey="Radivojac, P" uniqKey="Radivojac P">P Radivojac</name>
</author>
<author>
<name sortKey="Vacic, V" uniqKey="Vacic V">V Vacic</name>
</author>
<author>
<name sortKey="Cortese, Ms" uniqKey="Cortese M">MS Cortese</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meszaros, B" uniqKey="Meszaros B">B Mészáros</name>
</author>
<author>
<name sortKey="Simon, I" uniqKey="Simon I">I Simon</name>
</author>
<author>
<name sortKey="Dosztanyi, Z" uniqKey="Dosztanyi Z">Z Dosztányi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miri Disfani, F" uniqKey="Miri Disfani F">F Miri Disfani</name>
</author>
<author>
<name sortKey="Hsu, W L" uniqKey="Hsu W">W-L Hsu</name>
</author>
<author>
<name sortKey="Mizianty, Mj" uniqKey="Mizianty M">MJ Mizianty</name>
</author>
<author>
<name sortKey="Oldfield, Cj" uniqKey="Oldfield C">CJ Oldfield</name>
</author>
<author>
<name sortKey="Xue, B" uniqKey="Xue B">B Xue</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22761950</article-id>
<article-id pub-id-type="pmc">3384636</article-id>
<article-id pub-id-type="publisher-id">PONE-D-12-05194</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0040104</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Biology</subject>
<subj-group>
<subject>Computational Biology</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Immunology</subject>
<subj-group>
<subject>Immune Cells</subject>
<subj-group>
<subject>Antigen-Presenting Cells</subject>
<subject>B Cells</subject>
<subject>T Cells</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Antigen Processing and Recognition</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Molecular Cell Biology</subject>
<subj-group>
<subject>Cellular Types</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Computer Science</subject>
<subj-group>
<subject>Computer Modeling</subject>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Physics</subject>
<subj-group>
<subject>Biophysics</subject>
<subj-group>
<subject>Biomacromolecule-Ligand Interactions</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences</article-title>
<alt-title alt-title-type="running-head">BEST: Sequence-Based Prediction of B-Cell Epitopes</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Gao</surname>
<given-names>Jianzhao</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Faraggi</surname>
<given-names>Eshel</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhou</surname>
<given-names>Yaoqi</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ruan</surname>
<given-names>Jishou</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kurgan</surname>
<given-names>Lukasz</given-names>
</name>
<xref ref-type="aff" rid="aff5">
<sup>5</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America</addr-line>
</aff>
<aff id="aff4">
<label>4</label>
<addr-line>State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin, People's Republic of China</addr-line>
</aff>
<aff id="aff5">
<label>5</label>
<addr-line>Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Uversky</surname>
<given-names>Vladimir N.</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">University of South Florida College of Medicine, United States of America</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>gaojz@nankai.edu.cn</email>
(JG);
<email>lkurgan@ece.ualberta.ca</email>
(LK)</corresp>
<fn fn-type="con">
<p>Conceived and designed the experiments: JG YZ LK. Performed the experiments: JG. Analyzed the data: JG EF YZ JR LK. Contributed reagents/materials/analysis tools: JG EF LK. Wrote the paper: JG EF YZ JR LK.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>27</day>
<month>6</month>
<year>2012</year>
</pub-date>
<volume>7</volume>
<issue>6</issue>
<elocation-id>e40104</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>2</month>
<year>2012</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>5</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>Gao et al.</copyright-statement>
<copyright-year>2012</copyright-year>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.</license-p>
</license>
</permissions>
<abstract>
<p>Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods.</p>
</abstract>
<counts>
<page-count count="14"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>Identification of immunogenic regions/segments in a given antigen protein chain finds important applications in immunotherapies
<xref rid="pone.0040104-Chen1" ref-type="bibr">[1]</xref>
,
<xref rid="pone.0040104-Beck1" ref-type="bibr">[2]</xref>
. Experimental search for these regions is work and resource intensive and would benefit from guidance offered by computational methods that accurately identify these segments. Although such accurate methods are already in place for the prediction of T-cell epitopes
<xref rid="pone.0040104-Yang1" ref-type="bibr">[3]</xref>
,
<xref rid="pone.0040104-Tong1" ref-type="bibr">[4]</xref>
, further research is needed to develop accurate predictors of the B-cell epitopes
<xref rid="pone.0040104-Yang1" ref-type="bibr">[3]</xref>
,
<xref rid="pone.0040104-Blythe1" ref-type="bibr">[5]</xref>
. The B-cell epitopes are categorized into continuous (linear) and discontinuous (conformational). The majority of B-cell epitopes are conformational
<xref rid="pone.0040104-Pellequer1" ref-type="bibr">[6]</xref>
, however, the computational approaches concentrate mostly on the prediction of “easier” linear epitopes
<xref rid="pone.0040104-Yang1" ref-type="bibr">[3]</xref>
,
<xref rid="pone.0040104-Flower1" ref-type="bibr">[7]</xref>
.</p>
<p>The first attempts to predict the antigenic determinants concerning linear B-cell epitopes from protein chains date back to the 1980s
<xref rid="pone.0040104-Hopp1" ref-type="bibr">[8]</xref>
<xref rid="pone.0040104-Kolaskar1" ref-type="bibr">[12]</xref>
. These methods were relatively simple, monoparametric (based on a single propensity such as hydrophilicity), and were limited to small protein datasets. In the 1990s, researchers investigated the usefulness of multiple propensities including hydrophilicity, solvent accessibility, flexibility, and secondary structure propensities, for the B-cell epitope prediction
<xref rid="pone.0040104-Pellequer1" ref-type="bibr">[6]</xref>
,
<xref rid="pone.0040104-Pellequer2" ref-type="bibr">[13]</xref>
<xref rid="pone.0040104-Alix1" ref-type="bibr">[15]</xref>
. Results generated in these works were used to develop the BEPITOPE method
<xref rid="pone.0040104-Odorico1" ref-type="bibr">[16]</xref>
, which combines multiple propensities. The predictive quality of single propensity-based methods was critically evaluated by Blythe and Flower
<xref rid="pone.0040104-Blythe1" ref-type="bibr">[5]</xref>
, which motivated further development in this area. The last decade observed an influx of new methods that use more advanced models for the prediction of the linear epitopes. The BepiPred method
<xref rid="pone.0040104-Larsen1" ref-type="bibr">[17]</xref>
applies a hidden Markov model which takes two propensity scores as its inputs. A number of machine learning-based model were recently developed, from decision trees and k-nearest neighbor that utilized a combination of multiple propensities and sequence complexity as inputs
<xref rid="pone.0040104-Sllner1" ref-type="bibr">[18]</xref>
, to neural network-based ABCPred
<xref rid="pone.0040104-Saha1" ref-type="bibr">[19]</xref>
that performs predictions directly from protein chain. The later method is designed to recognize epitopic peptides with 20 or fewer (i.e., 10,12,14,16 and 20) amino acids (AAs). The newest sequence-based predictors of continuous B-cell epitopes exclusively use support vector machine (SVM) models. They include: (1) a method by Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
that predicts 20-mer peptides using a new AA pair-based antigenicity scale
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
; (2) BCPred
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
that predict the 12, 14, 16, 18, 20, and 22-mer long epitopes directly from sequence using a new type of string kernel-based SVM; (3) COBEpro
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
which utilizes a two-stage design with an SVM that takes novel sequence similarity scores as inputs to predict variable-size peptides in the first stage and a second stage that combines these fragments to predict epitopes in full chains; and (4) BayesB method
<xref rid="pone.0040104-Wee1" ref-type="bibr">[23]</xref>
that predicts epitopes of diverse lengths (from 12 to 20-mers) using position specific scoring matrix (PSSM) generated with PSI-BLAST
<xref rid="pone.0040104-Altschul1" ref-type="bibr">[24]</xref>
. We note that COBEpro was extended to predict conformational epitopes via its second stage. Moreover, one sequence-based method, CBTOPE
<xref rid="pone.0040104-Ansari1" ref-type="bibr">[25]</xref>
, was proposed for the prediction of conformational epitopes. This is also an SVM-based predictor that utilizes multiple propensities and sequence-derived inputs including composition and collocation of AAs.</p>
<fig id="pone-0040104-g001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g001</object-id>
<label>Figure 1</label>
<caption>
<title>Overall design of the proposed BEST method.</title>
</caption>
<graphic xlink:href="pone.0040104.g001"></graphic>
</fig>
<p>There are also a few predictors that use protein structure as their input and which predict the conformational epitopes. Early structure-based methods use relatively simple scoring-based approaches. They include CEP
<xref rid="pone.0040104-KulkarniKale1" ref-type="bibr">[26]</xref>
that is based on scoring surface AAs using their solvent accessibility, DiscoTope
<xref rid="pone.0040104-HasteAndersen1" ref-type="bibr">[27]</xref>
, which uses surface/solvent accessibility, contact numbers, and AA propensity scores, and SEPPA
<xref rid="pone.0040104-Sun1" ref-type="bibr">[28]</xref>
that combines a new propensity score with information about solvent accessibility and the packing density of AAs. More recent methods use machine learning models to perform predictions. These include PEPITO
<xref rid="pone.0040104-Sweredoski2" ref-type="bibr">[29]</xref>
that applies linear regression to AA propensity scores and solvent accessibility quantified using half sphere exposure; EPSVR
<xref rid="pone.0040104-Liang1" ref-type="bibr">[30]</xref>
that uses Support Vector Regression and several inputs including epitope propensity scores, contact numbers, secondary structure composition, conservation, side chain energy surface and planarity scores; a method by Zhang et al.
<xref rid="pone.0040104-Zhang1" ref-type="bibr">[31]</xref>
, which utilizes random forest model; and a predictor by Liu and Hu
<xref rid="pone.0040104-Liu1" ref-type="bibr">[32]</xref>
that uses logistic regression model and information concerning B-factors and relative accessible surface area. Moreover, in recent years two new types of approaches were developed. The first, called Bepar
<xref rid="pone.0040104-Zhao1" ref-type="bibr">[33]</xref>
is based on association patterns between antibody and antigen residues and the other, EPMeta
<xref rid="pone.0040104-Liang1" ref-type="bibr">[30]</xref>
, is a consensus-based method, which combines multiple discontinuous epitope predictors. Finally, Epitopia
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
,
<xref rid="pone.0040104-Rubinstein2" ref-type="bibr">[35]</xref>
is a machine learning-based approach which utilizes Naïve Bayes to process information extracted based on physico-chemical and structural-geometrical properties from a surface patch defined using solvent accessibility. Since this method allows performing predictions from sequence alone, we include it in our comparative analysis.</p>
<table-wrap id="pone-0040104-t001" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.t001</object-id>
<label>Table 1</label>
<caption>
<title>Summary of the considered features and features selected and used in the proposed sequence-based predictor of B-cell epitopes.</title>
</caption>
<alternatives>
<graphic id="pone-0040104-t001-1" xlink:href="pone.0040104.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Feature group</td>
<td align="left" rowspan="1" colspan="1">Abbreviated name</td>
<td align="left" rowspan="1" colspan="1">Number of features</td>
<td align="left" rowspan="1" colspan="1">Number of selected features</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Predicted secondary structure (SS)</td>
<td align="left" rowspan="1" colspan="1">SS</td>
<td align="left" rowspan="1" colspan="1">8</td>
<td align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Predicted RSA</td>
<td align="left" rowspan="1" colspan="1">RA</td>
<td align="left" rowspan="1" colspan="1">33</td>
<td align="left" rowspan="1" colspan="1">5</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">RAAP score</td>
<td align="left" rowspan="1" colspan="1">RP</td>
<td align="left" rowspan="1" colspan="1">30</td>
<td align="left" rowspan="1" colspan="1">24</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Conservation score</td>
<td align="left" rowspan="1" colspan="1">CS</td>
<td align="left" rowspan="1" colspan="1">29</td>
<td align="left" rowspan="1" colspan="1">2</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Predicted SS and RSA</td>
<td align="left" rowspan="1" colspan="1">SS+RA</td>
<td align="left" rowspan="1" colspan="1">12</td>
<td align="left" rowspan="1" colspan="1">6</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Predicted SS and conservation score</td>
<td align="left" rowspan="1" colspan="1">SS+CS</td>
<td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Predicted SS and RAAP score</td>
<td align="left" rowspan="1" colspan="1">SS+RP</td>
<td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">RAAP score and predicted RSA</td>
<td align="left" rowspan="1" colspan="1">RP+RA</td>
<td align="left" rowspan="1" colspan="1">30</td>
<td align="left" rowspan="1" colspan="1">17</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">RAAP and conservation scores</td>
<td align="left" rowspan="1" colspan="1">RP+CS</td>
<td align="left" rowspan="1" colspan="1">28</td>
<td align="left" rowspan="1" colspan="1">18</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Predicted SS and RSA, and RAAP score</td>
<td align="left" rowspan="1" colspan="1">SS+RA+RP</td>
<td align="left" rowspan="1" colspan="1">6</td>
<td align="left" rowspan="1" colspan="1">1</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Similarity score</td>
<td align="left" rowspan="1" colspan="1">SIM</td>
<td align="left" rowspan="1" colspan="1">10</td>
<td align="left" rowspan="1" colspan="1">7</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Total number of features</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">198</td>
<td align="left" rowspan="1" colspan="1">84</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone-0040104-t002" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.t002</object-id>
<label>Table 2</label>
<caption>
<title>Comparison of predictive quality on the BCPREDFrag dataset calculated using 10-fold cross validation. The methods are sorted by their AUC values in the ascending order.</title>
</caption>
<alternatives>
<graphic id="pone-0040104-t002-2" xlink:href="pone.0040104.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Method</td>
<td align="left" rowspan="1" colspan="1">AUC</td>
<td align="left" rowspan="1" colspan="1">Accuracy</td>
<td align="left" rowspan="1" colspan="1">Sensitivity</td>
<td align="left" rowspan="1" colspan="1">Specificity</td>
<td align="left" rowspan="1" colspan="1">Precision</td>
<td align="left" rowspan="1" colspan="1">F-measure</td>
<td align="left" rowspan="1" colspan="1">MCC</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
<xref ref-type="table-fn" rid="nt101">a</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.700</td>
<td align="left" rowspan="1" colspan="1">0.641</td>
<td align="left" rowspan="1" colspan="1">0.529</td>
<td align="left" rowspan="1" colspan="1">0.752</td>
<td align="left" rowspan="1" colspan="1">0.681</td>
<td align="left" rowspan="1" colspan="1">0.596</td>
<td align="left" rowspan="1" colspan="1">0.29</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">BCPred
<xref ref-type="table-fn" rid="nt101">a</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.758</td>
<td align="left" rowspan="1" colspan="1">0.679</td>
<td align="left" rowspan="1" colspan="1">0.726</td>
<td align="left" rowspan="1" colspan="1">0.632</td>
<td align="left" rowspan="1" colspan="1">0.664</td>
<td align="left" rowspan="1" colspan="1">0.694</td>
<td align="left" rowspan="1" colspan="1">0.36</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">COBEpro
<xref ref-type="table-fn" rid="nt102">b</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.768</td>
<td align="left" rowspan="1" colspan="1">0.714</td>
<td align="left" rowspan="1" colspan="1">0.554</td>
<td align="left" rowspan="1" colspan="1">0.874</td>
<td align="left" rowspan="1" colspan="1">0.815</td>
<td align="left" rowspan="1" colspan="1">0.660</td>
<td align="left" rowspan="1" colspan="1">0.45</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SVM model 198
<xref ref-type="table-fn" rid="nt103">c</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.811</td>
<td align="left" rowspan="1" colspan="1">0.745</td>
<td align="left" rowspan="1" colspan="1">0.561</td>
<td align="left" rowspan="1" colspan="1">0.929</td>
<td align="left" rowspan="1" colspan="1">0.887</td>
<td align="left" rowspan="1" colspan="1">0.687</td>
<td align="left" rowspan="1" colspan="1">0.53</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SVM model 84
<xref ref-type="table-fn" rid="nt104">d</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.813</td>
<td align="left" rowspan="1" colspan="1">0.740</td>
<td align="left" rowspan="1" colspan="1">0.495</td>
<td align="left" rowspan="1" colspan="1">0.984</td>
<td align="left" rowspan="1" colspan="1">0.969</td>
<td align="left" rowspan="1" colspan="1">0.655</td>
<td align="left" rowspan="1" colspan="1">0.55</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt101">
<label>a</label>
<p>results from
<xref ref-type="table" rid="pone-0040104-t001">Table 1</xref>
in
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
.</p>
</fn>
<fn id="nt102">
<label>b</label>
<p>results from Table II in
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
.</p>
</fn>
<fn id="nt103">
<label>c</label>
<p>results for the SVM model (
<italic>C</italic>
 = 8.0 and
<italic>gamma</italic>
 = 0.000977) that uses all 198 features.</p>
</fn>
<fn id="nt104">
<label>d</label>
<p>results for the SVM model (
<italic>C</italic>
 = 1.0 and
<italic>gamma</italic>
 = 0.001953) that uses the selected 84 features.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Our aim is to develop an accurate computational model for the prediction of both linear and conformational epitopes based on an approach similar to COBEpro
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
. We design a novel two-stage scheme that predicts conformational and linear epitopes from antigen chains based on accurate predictions of linear epitopes from the first stage. The motivation for our design comes from the fact that current methods use a wide variety of diverse inputs. We hypothesize that improvements can be attained by combining these inputs. The novelty of our BEST (Bcell Epitope prediction using Support vector machine Tool) method is two-fold. First, we effectively use multiple inputs including sequence conservation calculated using outputs from PSI-BLAST, predicted solvent accessibility and secondary structure (SS), and certain propensity and sequence similarity scores. Some of these inputs are motivated by existing works
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
,
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
,
<xref rid="pone.0040104-Wee1" ref-type="bibr">[23]</xref>
,
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
,
<xref rid="pone.0040104-Rubinstein2" ref-type="bibr">[35]</xref>
. However, we are the first to propose a sequence-based method that uses the residue conservation scores (conservation was previously used to build the structure-based EPSVR predictor
<xref rid="pone.0040104-Liang1" ref-type="bibr">[30]</xref>
) and to generate novel descriptors/features that combine multiple inputs, such as SS and conservation, SS and an antigenicity scale, solvent accessibility and conservation, etc. Second, we use a novel design of the second stage that utilizes a sliding window based on predictions of linear epitopes to compute propensities for formation of epitopes (both linear and conformational) for all residues in the input antigen sequence. This allows for more practical applications, in contrast to some other solutions, such as ABCPred
<xref rid="pone.0040104-Saha1" ref-type="bibr">[19]</xref>
, method by Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
, BCPred
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
, and BayesB
<xref rid="pone.0040104-Wee1" ref-type="bibr">[23]</xref>
, which predict only short peptide fragments. Moreover, we empirically demonstrate that BEST outperforms recent sequence-based solutions including the method by Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
, BCPred
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
, ABCPred
<xref rid="pone.0040104-Saha1" ref-type="bibr">[19]</xref>
, CBTOPE
<xref rid="pone.0040104-Ansari1" ref-type="bibr">[25]</xref>
, and COBEpro
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
.</p>
<table-wrap id="pone-0040104-t003" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.t003</object-id>
<label>Table 3</label>
<caption>
<title>Comparison of predictive quality on the ChenFrag dataset calculated using either 10-fold cross validation or 5-fold cross validation to match the test type from the corresponding manuscripts. The methods are sorted by their AUC values in the ascending order.</title>
</caption>
<alternatives>
<graphic id="pone-0040104-t003-3" xlink:href="pone.0040104.t003"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Method</td>
<td align="left" rowspan="1" colspan="1">AUC</td>
<td align="left" rowspan="1" colspan="1">Accuracy</td>
<td align="left" rowspan="1" colspan="1">Sensitivity</td>
<td align="left" rowspan="1" colspan="1">Specificity</td>
<td align="left" rowspan="1" colspan="1">Precision</td>
<td align="left" rowspan="1" colspan="1">F-measure</td>
<td align="left" rowspan="1" colspan="1">MCC</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
<xref ref-type="table-fn" rid="nt106">a</xref>
</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
<td align="left" rowspan="1" colspan="1">0.725</td>
<td align="left" rowspan="1" colspan="1">0.636</td>
<td align="left" rowspan="1" colspan="1">0.765</td>
<td align="left" rowspan="1" colspan="1">0.701</td>
<td align="left" rowspan="1" colspan="1">0.667</td>
<td align="left" rowspan="1" colspan="1">0.40</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SVM model 198
<xref ref-type="table-fn" rid="nt107">b</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.835</td>
<td align="left" rowspan="1" colspan="1">0.783</td>
<td align="left" rowspan="1" colspan="1">0.587</td>
<td align="left" rowspan="1" colspan="1">0.979</td>
<td align="left" rowspan="1" colspan="1">0.966</td>
<td align="left" rowspan="1" colspan="1">0.730</td>
<td align="left" rowspan="1" colspan="1">0.62</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">COBEpro
<xref ref-type="table-fn" rid="nt108">c</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.829</td>
<td align="left" rowspan="1" colspan="1">0.780</td>
<td align="left" rowspan="1" colspan="1">0.609</td>
<td align="left" rowspan="1" colspan="1">0.951</td>
<td align="left" rowspan="1" colspan="1">0.925</td>
<td align="left" rowspan="1" colspan="1">0.734</td>
<td align="left" rowspan="1" colspan="1">0.59</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SVM model 198
<xref ref-type="table-fn" rid="nt109">d</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.840</td>
<td align="left" rowspan="1" colspan="1">0.792</td>
<td align="left" rowspan="1" colspan="1">0.597</td>
<td align="left" rowspan="1" colspan="1">0.987</td>
<td align="left" rowspan="1" colspan="1">0.979</td>
<td align="left" rowspan="1" colspan="1">0.742</td>
<td align="left" rowspan="1" colspan="1">0.63</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">SVM model 84
<xref ref-type="table-fn" rid="nt110">e</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.848</td>
<td align="left" rowspan="1" colspan="1">0.788</td>
<td align="left" rowspan="1" colspan="1">0.579</td>
<td align="left" rowspan="1" colspan="1">0.998</td>
<td align="left" rowspan="1" colspan="1">0.996</td>
<td align="left" rowspan="1" colspan="1">0.732</td>
<td align="left" rowspan="1" colspan="1">0.63</td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt105">
<label></label>
<p>The methods are sorted by their AUC values in the ascending order.</p>
</fn>
<fn id="nt106">
<label>a</label>
<p>results based on 5-fold cross validation from
<xref ref-type="table" rid="pone-0040104-t003">Table 3</xref>
in
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
.</p>
</fn>
<fn id="nt107">
<label>b</label>
<p>results based on 5-fold cross validation for the SVM model (
<italic>C</italic>
 = 8.0 and
<italic>gamma</italic>
 = 0.000977) that uses all 198 features.</p>
</fn>
<fn id="nt108">
<label>c</label>
<p>results based on 10-fold cross validation from Table I in
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
.</p>
</fn>
<fn id="nt109">
<label>d</label>
<p>results based on 10-fold cross validation for the SVM model (
<italic>C</italic>
 = 8.0 and
<italic>gamma</italic>
 = 0.000977) that uses all 198 features.</p>
</fn>
<fn id="nt110">
<label>e</label>
<p>results based on 10-fold cross validation for the SVM model (
<italic>C</italic>
 = 1.0 and
<italic>gamma</italic>
 = 0.001953) that uses the selected 84 features.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec sec-type="methods" id="s2">
<title>Methods</title>
<sec id="s2a">
<title>Overview of the proposed B-cell epitope predictor</title>
<p>BEST utilizes a two-stage design, see
<xref ref-type="fig" rid="pone-0040104-g001">Figure 1</xref>
. In the first stage, we use a sliding window to represent the input antigen chain as a set of 20-mers. These 20-mers are encoded by a numerical feature vector that quantifies information in the window, which includes features extracted from</p>
<list list-type="bullet">
<list-item>
<p>The chain including AA propensity scale that was introduced in
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
and sequence similarity scores proposed in
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
against a database of known (training) epitopic and non-epitopic peptides.</p>
</list-item>
<list-item>
<p>The evolutionary profile generated by PSI-BLAST including conservation scores calculated from the Weighted Observation Percentage (WOP) matrix.</p>
</list-item>
<list-item>
<p>The secondary structure and solvent accessibility that are predicted from the input chain with SPINE
<xref rid="pone.0040104-Faraggi1" ref-type="bibr">[36]</xref>
,
<xref rid="pone.0040104-Dor1" ref-type="bibr">[37]</xref>
.</p>
</list-item>
</list>
<p>Motivated by the designs of recent predictors
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
<xref rid="pone.0040104-Wee1" ref-type="bibr">[23]</xref>
,
<xref rid="pone.0040104-Ansari1" ref-type="bibr">[25]</xref>
, we apply an SVM-based model to predict epitopes using these features. In the second stage, we combine predictions from the SVM using a novel, custom-designed scheme that outputs the propensity of each AA to form of a B-cell epitope.</p>
<fig id="pone-0040104-g002" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Receiver operating characteristic (ROC) curves for the SVM model with 84 features, RAAP and MaxSimilarity models.</title>
<p>The curves were computed based on the 10-fold cross validation on the BCPREDFrag dataset (panel A) and ChenFrag dataset (panel B).</p>
</caption>
<graphic xlink:href="pone.0040104.g002"></graphic>
</fig>
</sec>
<sec id="s2b">
<title>Datasets and test protocols</title>
<p>We use two datasets composed of 20-mers. The ChenFrag dataset, which was introduced in
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
, consists of 872 20-mers that are B-cell epitopes and 872 non-B-cell epitope 20-mers. The epitope 20-mers were generated by a truncation-and-extension from BciPep database
<xref rid="pone.0040104-Saha2" ref-type="bibr">[38]</xref>
and the non-epitope fragments were taken from SWISS-PROT. The BCPREDFrag dataset was introduced in
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
and includes 701 epitopes 20-mers and 701 non-epitopes 20-mers. Originally, this dataset included 947 unique epitopes extracted from the BciPep database. After truncation-and-extension to 20-mers this set was no longer non-redundant. Therefore, they were processed using CD-HIT
<xref rid="pone.0040104-Li1" ref-type="bibr">[39]</xref>
to obtain a reduced set of 701 epitopes, which share at most 80% similarity. The non-epitopes were selected from SWISS-PROT. We use this dataset to design (select relevant features and parameterize the SVM) our predictive model using 10-fold cross validation. The final design (using the same parameters and features) is tested on the ChenFrag dataset using 10-fold cross validation. The use of the 10-fold cross validation is motivated by the fact that the same test protocol was used in prior works
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
,
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
.</p>
<table-wrap id="pone-0040104-t004" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.t004</object-id>
<label>Table 4</label>
<caption>
<title>AUC values on the BCPREDFrag and ChenFrag datasets calculated using 10-fold cross validation obtained by using selected features from individual feature groups; abbreviates names of feature groups are given in
<xref ref-type="table" rid="pone-0040104-t001">Table 1</xref>
.</title>
</caption>
<alternatives>
<graphic id="pone-0040104-t004-4" xlink:href="pone.0040104.t004"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Dataset</td>
<td align="left" rowspan="1" colspan="1">SS</td>
<td align="left" rowspan="1" colspan="1">RA</td>
<td align="left" rowspan="1" colspan="1">RP</td>
<td align="left" rowspan="1" colspan="1">CS</td>
<td align="left" rowspan="1" colspan="1">SS+RA</td>
<td align="left" rowspan="1" colspan="1">SS+CS</td>
<td align="left" rowspan="1" colspan="1">SS+RP</td>
<td align="left" rowspan="1" colspan="1">RP+RA</td>
<td align="left" rowspan="1" colspan="1">RP+CS</td>
<td align="left" rowspan="1" colspan="1">SS+RA+RP</td>
<td align="left" rowspan="1" colspan="1">SIM</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">BCPREDFrag</td>
<td align="left" rowspan="1" colspan="1">0.557</td>
<td align="left" rowspan="1" colspan="1">0.542</td>
<td align="left" rowspan="1" colspan="1">0.716</td>
<td align="left" rowspan="1" colspan="1">0.501</td>
<td align="left" rowspan="1" colspan="1">0.602</td>
<td align="left" rowspan="1" colspan="1">0.568</td>
<td align="left" rowspan="1" colspan="1">0.532</td>
<td align="left" rowspan="1" colspan="1">0.695</td>
<td align="left" rowspan="1" colspan="1">0.710</td>
<td align="left" rowspan="1" colspan="1">0.556</td>
<td align="left" rowspan="1" colspan="1">0.760</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">ChenFrag</td>
<td align="left" rowspan="1" colspan="1">0.565</td>
<td align="left" rowspan="1" colspan="1">0.547</td>
<td align="left" rowspan="1" colspan="1">0.743</td>
<td align="left" rowspan="1" colspan="1">0.496</td>
<td align="left" rowspan="1" colspan="1">0.584</td>
<td align="left" rowspan="1" colspan="1">0.545</td>
<td align="left" rowspan="1" colspan="1">0.555</td>
<td align="left" rowspan="1" colspan="1">0.738</td>
<td align="left" rowspan="1" colspan="1">0.743</td>
<td align="left" rowspan="1" colspan="1">0.560</td>
<td align="left" rowspan="1" colspan="1">0.824</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>We use an independent test set that was utilized as a test dataset in
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
. This dataset, which we call SEQ194, includes 194 protein sequences. Since the SEQ194 dataset was also derived from the BciPep database, we reduce the identity between SEQ194 and the BCPREDFrag dataset (which is used as our training/design dataset) to 40%. To do that, we remove any 20-mer from our training dataset that shares above 40% identity with any chain in SEQ194, and we call the resulting dataset Filtered40_BCPREDFrag. This dataset includes 633 20-mer fragments with 86 epitopic fragments and 547 non-epitopic fragments. When testing our method on the SEQ194, we build our predictor using the Filtered40_BCPREDFrag. This includes the use of the filtered version of the training dataset as a database of known epitopic and non-epitopic peptides for which we calculate the sequence similarity scores according to the method from
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
.</p>
<fig id="pone-0040104-g003" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g003</object-id>
<label>Figure 3</label>
<caption>
<title>The values of the similarity-based scores between the 20-mers from the BCPREDFrag dataset and the library of the epitope fragments, i.e., the
<italic>max_similarity_epitope
<sub>1</sub>
</italic>
feature.</title>
<p>The black line shows the similarity scores for the native epitope and the gray line for the non-epitope fragments. The
<italic>x</italic>
-axis corresponds to the sorted list (in the ascending order based on the similarity scores) of the 701 epitopic and 701 non-epitopic 20-mers from the BCPREDFrag dataset, and the
<italic>y</italic>
-axis shows their corresponding similarity scores.</p>
</caption>
<graphic xlink:href="pone.0040104.g003"></graphic>
</fig>
<p>We also use a second sequence-based test dataset called SEQ19, which includes 19 proteins and which was introduced in
<xref rid="pone.0040104-Liang1" ref-type="bibr">[30]</xref>
. The dataset was extracted using Conformational Epitope Database
<xref rid="pone.0040104-Huang1" ref-type="bibr">[40]</xref>
by considering entries with unbound antigen structures, no complex structures, and where multiple entries with the same antigen structure were combined (antigenic residues from multiple entries were mapped onto one structure). The pairwise sequence identity in this dataset was reduced to up to 35%.</p>
<p>The datasets are available at
<ext-link ext-link-type="uri" xlink:href="http://biomine.ece.ualberta.ca/BEST/">http://biomine.ece.ualberta.ca/BEST/</ext-link>
.</p>
<fig id="pone-0040104-g004" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g004</object-id>
<label>Figure 4</label>
<caption>
<title>The AUC and success rate values in the function of the number of selected scores
<italic>k</italic>
(
<italic>x</italic>
-axis) when using SVM model with 84 features and the
<italic>distance scheme</italic>
to predict B-cell epitopes on the SEQ194 dataset.</title>
<p>We use the Filtered40_BCPREDFrag to generate the SVM model.</p>
</caption>
<graphic xlink:href="pone.0040104.g004"></graphic>
</fig>
<table-wrap id="pone-0040104-t005" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.t005</object-id>
<label>Table 5</label>
<caption>
<title>The AUC and success rate for the prediction of the B-cell epitopes on the SEQ194 dataset when using predictions from the SVM model with 84 features and the five schemes:
<italic>maximum, average, median, and distance scheme</italic>
with
<italic>k</italic>
 = 10 and
<italic>k</italic>
 = 16. We use the Filtered40_BCPREDFrag to generate the SVM model.</title>
</caption>
<alternatives>
<graphic id="pone-0040104-t005-5" xlink:href="pone.0040104.t005"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Method</td>
<td align="left" rowspan="1" colspan="1">Success rate</td>
<td align="left" rowspan="1" colspan="1">AUC</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Max scheme</td>
<td align="left" rowspan="1" colspan="1">47.4%</td>
<td align="left" rowspan="1" colspan="1">0.52</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average scheme</td>
<td align="left" rowspan="1" colspan="1">56.2%</td>
<td align="left" rowspan="1" colspan="1">0.56</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Median scheme</td>
<td align="left" rowspan="1" colspan="1">60.8%</td>
<td align="left" rowspan="1" colspan="1">0.55</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Distance scheme
<italic>k</italic>
 = 10</td>
<td align="left" rowspan="1" colspan="1">58.8%</td>
<td align="left" rowspan="1" colspan="1">0.57</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Distance scheme
<italic>k</italic>
 = 16</td>
<td align="left" rowspan="1" colspan="1">60.3%</td>
<td align="left" rowspan="1" colspan="1">0.57</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
</sec>
<sec id="s2c">
<title>Evaluation of predictive quality</title>
<p>The predicted propensity of a given AA in the input protein chain is a real number which is (often) binarized to denote two outcomes: whether or not the residue is a part of an epitope. The evaluation of the binary predictions uses several quality measures including accuracy (ACC), sensitivity, specificity, precision, F-measure, and Matthews correlation coefficient (MCC):</p>
<p>Accuracy  =  (TP+TN)/(TP+FP+TN+FN)</p>
<p>Sensitivity  =  TP/(TP+FN)</p>
<p>Specificity  =  TN/(TN+FP)</p>
<p>Precision  =  TP/(TP+FP)</p>
<p>F-measure  = 2*TP/(2*TP+FN+FP)</p>
<p>MCC  =  (TP*TN+FP*FN)/sqrt{(TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)}</p>
<p>where TP and TN are the number of correctly predicted epitope and non-epitope residues, respectively, FP is the number of non-epitope residues that were predicted to be in an epitope, and FN is the number of epitope residues that were predicted not to be in an epitope. Higher values of these measures indicate better quality of predictions.</p>
<fig id="pone-0040104-g005" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g005</object-id>
<label>Figure 5</label>
<caption>
<title>The average AUC values estimated using SEQ194 dataset.</title>
<p>The values were calculated over the 10 repetitions using 100 randomly selected chains from the SEQ194 dataset (shown using gray bars) and the corresponding standard deviations (shown using black error bars) for the considered B-cell epitope predictors.</p>
</caption>
<graphic xlink:href="pone.0040104.g005"></graphic>
</fig>
<table-wrap id="pone-0040104-t006" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.t006</object-id>
<label>Table 6</label>
<caption>
<title>Comparison of the proposed BEST method with existing B-cell epitope predictors on the SEQ149 dataset.</title>
</caption>
<alternatives>
<graphic id="pone-0040104-t006-6" xlink:href="pone.0040104.t006"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Category</td>
<td align="left" rowspan="1" colspan="1">Method</td>
<td align="left" rowspan="1" colspan="1">Success rate</td>
<td align="left" rowspan="1" colspan="1">AUC</td>
<td colspan="2" align="left" rowspan="1">Significance of improvement in AUC</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">compared to BEST
<sub>16</sub>
<xref ref-type="table-fn" rid="nt118">g</xref>
</td>
<td align="left" rowspan="1" colspan="1">compared to BEST
<sub>10</sub>
<xref ref-type="table-fn" rid="nt118">g</xref>
</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Structure- based</td>
<td align="left" rowspan="1" colspan="1">Epitopia
<xref ref-type="table-fn" rid="nt112">a</xref>
</td>
<td align="left" rowspan="1" colspan="1">80.4%</td>
<td align="left" rowspan="1" colspan="1">0.59</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">Epitopia
<xref ref-type="table-fn" rid="nt113">b</xref>
</td>
<td align="left" rowspan="1" colspan="1">73.7%</td>
<td align="left" rowspan="1" colspan="1">0.57</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Sequence- based</td>
<td align="left" rowspan="1" colspan="1">ABCPred
<xref ref-type="table-fn" rid="nt112">a</xref>
</td>
<td align="left" rowspan="1" colspan="1">67.0%</td>
<td align="left" rowspan="1" colspan="1">0.55</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">ABCPred
<xref ref-type="table-fn" rid="nt114">c</xref>
</td>
<td align="left" rowspan="1" colspan="1">61.9%</td>
<td align="left" rowspan="1" colspan="1">0.53</td>
<td align="left" rowspan="1" colspan="1">+</td>
<td align="left" rowspan="1" colspan="1">+</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">BayesB
<xref ref-type="table-fn" rid="nt115">d</xref>
</td>
<td align="left" rowspan="1" colspan="1">80.9%</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">CBTOPE
<xref ref-type="table-fn" rid="nt116">e</xref>
</td>
<td align="left" rowspan="1" colspan="1">45.9%</td>
<td align="left" rowspan="1" colspan="1">0.52</td>
<td align="left" rowspan="1" colspan="1">+</td>
<td align="left" rowspan="1" colspan="1">+</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">COBEpro
<xref ref-type="table-fn" rid="nt112">a</xref>
</td>
<td align="left" rowspan="1" colspan="1">66.9%</td>
<td align="left" rowspan="1" colspan="1">0.55</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
<td align="left" rowspan="1" colspan="1">unavailable</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">COBEpro
<xref ref-type="table-fn" rid="nt117">f</xref>
</td>
<td align="left" rowspan="1" colspan="1">66.3%</td>
<td align="left" rowspan="1" colspan="1">0.54</td>
<td align="left" rowspan="1" colspan="1">+</td>
<td align="left" rowspan="1" colspan="1">+</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">BEST
<sub>10</sub>
<xref ref-type="table-fn" rid="nt118">g</xref>
</td>
<td align="left" rowspan="1" colspan="1">58.8%</td>
<td align="left" rowspan="1" colspan="1">0.57</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">BEST
<sub>16</sub>
<xref ref-type="table-fn" rid="nt118">g</xref>
</td>
<td align="left" rowspan="1" colspan="1">60.3%</td>
<td align="left" rowspan="1" colspan="1">0.57</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
</alternatives>
<table-wrap-foot>
<fn id="nt111">
<label></label>
<p>The methods are sorted alphabetically within each category. We evaluate significance of differences between BEST
<sub>16</sub>
(BEST
<sub>10</sub>
) and the other methods. We compare the corresponding AUC values in 10 paired results based on 100 random selected chains from the SEQ194 dataset using paired t-test; +/– mean that BEST
<sub>16</sub>
(BEST
<sub>10</sub>
) are significantly better/worse that another method at
<italic>p</italic>
-value <0.05.</p>
</fn>
<fn id="nt112">
<label>a</label>
<p>results from
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
.</p>
</fn>
<fn id="nt113">
<label>b</label>
<p>results from the Epitopia web server at
<ext-link ext-link-type="uri" xlink:href="http://epitopia.tau.ac.il/">http://epitopia.tau.ac.il/</ext-link>
.</p>
</fn>
<fn id="nt114">
<label>c</label>
<p>results from the ABCPred web server
<ext-link ext-link-type="uri" xlink:href="http://www.imtech.res.in/raghava/abcpred/">http://www.imtech.res.in/raghava/abcpred/</ext-link>
.</p>
</fn>
<fn id="nt115">
<label>d</label>
<p>results from the BayesB web server at
<ext-link ext-link-type="uri" xlink:href="http://www.immunopred.org/bayesb/index.html">http://www.immunopred.org/bayesb/index.html</ext-link>
.</p>
</fn>
<fn id="nt116">
<label>e</label>
<p>results from the CBTOPE web server at
<ext-link ext-link-type="uri" xlink:href="http://www.imtech.res.in/raghava/cbtope/">http://www.imtech.res.in/raghava/cbtope/</ext-link>
.</p>
</fn>
<fn id="nt117">
<label>f</label>
<p>results from the COBEpro web server at
<ext-link ext-link-type="uri" xlink:href="http://scratch.proteomics.ics.uci.edu/">http://scratch.proteomics.ics.uci.edu/</ext-link>
.</p>
</fn>
<fn id="nt118">
<label>g</label>
<p>results generated using BEST method, which is based on the SVM model (
<italic>C</italic>
 = 1.0 and
<italic>gamma</italic>
 = 0.001953) with 84 features generated with the Filtered40_BCPREDFrag dataset and the distance scheme with
<italic>k</italic>
 = 16 (BEST
<sub>16</sub>
) and with
<italic>k</italic>
 = 10 (BEST
<sub>10</sub>
).</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>We calculate the area under the ROC curve (AUC) to evaluate the real-valued predictions. We also use the success rate that was proposed earlier
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
,
<xref rid="pone.0040104-Rubinstein2" ref-type="bibr">[35]</xref>
. The success rate is defined by the number of correctly predicted proteins divided by the total number of predicted proteins. A given chain is assumed to be correctly predicted if the average of the real-valued predicted propensities for the native epitope residues is larger than the average real-valued predicted propensities of all residues in that chain.</p>
<fig id="pone-0040104-g006" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g006</object-id>
<label>Figure 6</label>
<caption>
<title>Receiver operating characteristic (ROC) curves of the considered B-cell epitope predictors on the SEQ194 dataset.</title>
</caption>
<graphic xlink:href="pone.0040104.g006"></graphic>
</fig>
</sec>
<sec id="s2d">
<title>Feature-based representation of the input sequence</title>
<p>We considered five types of input information to calculate our features: predicted secondary structure, predicted solvent accessibility, dipeptides-based antigenicity scale, and the conservation and similarity scores.</p>
<fig id="pone-0040104-g007" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g007</object-id>
<label>Figure 7</label>
<caption>
<title>Receiver operating characteristic (ROC) curves of the considered B-cell epitope predictors on the SEQ19 dataset.</title>
</caption>
<graphic xlink:href="pone.0040104.g007"></graphic>
</fig>
<p>Secondary structure and solvent accessibility were predicted with the standalone version 3.0 of Real-SPINE
<xref rid="pone.0040104-Faraggi1" ref-type="bibr">[36]</xref>
. We use relative solvent accessibility (RSA), which is defined as the ratio of solvent accessible surface area (ASA) of a residue observed in its three dimensional structure to that observed in an extended tripeptide conformation. We normalize the ASA values generated by Real-SPINE using Ala-X-Ala tripeptide as suggested in
<xref rid="pone.0040104-Ahmad1" ref-type="bibr">[41]</xref>
,
<xref rid="pone.0040104-Ahmad2" ref-type="bibr">[42]</xref>
. The RSA values were used to categorize residues as buried (if predicted RSA<25%) or solvent exposed (otherwise).</p>
<fig id="pone-0040104-g008" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0040104.g008</object-id>
<label>Figure 8</label>
<caption>
<title>Residue epitopic propensities predicted by ABCPred, COBEpro, Epitopia and BEST for a capsid protein (UniProt ID: P16489; panel A) and an anti-repression transactivator protein (UniProt ID: P20869; panel B).</title>
<p>The plots also include the location of the native epitopes. The
<italic>x</italic>
-axis shows the protein chain and the location of the native epitopes (denoted with black horizontal line) and
<italic>y</italic>
-axis shows the values of the predicted propensities. The left
<italic>y</italic>
-axis gives the propensities for ABCpred, COBEpro and Epitopis and the right
<italic>y</italic>
-axis for BEST.</p>
</caption>
<graphic xlink:href="pone.0040104.g008"></graphic>
</fig>
<p>The amino acid pair propensity scale (AAP) was first introduced by Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
. This scale quantifies propensity of a given dipeptide (AA pair) to form B-cell epitope and was shown to provide useful information to predict B-cell epitopes
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
. The original AAP values were renormalized to the (−1, 1) interval
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
and we denote them as the RAAP scale.</p>
<p>We run PSI-BLAST
<xref rid="pone.0040104-Altschul1" ref-type="bibr">[24]</xref>
against the nr dataset using default parameters (-j 3, -d nr) to compute the conservation which is defined as
<xref rid="pone.0040104-Wang1" ref-type="bibr">[43]</xref>
:</p>
<p>Conservation  =  SUM
<italic>
<sub>i = 1..20</sub>
</italic>
{
<italic>P
<sub>i</sub>
</italic>
*log
<sub>2</sub>
(
<italic>P
<sub>i</sub>
/P
<sub>ib</sub>
</italic>
}</p>
<p>where
<italic>P
<sub>i</sub>
</italic>
is the value from the Weighted Observation Percentage (WOP) matrix generated by PSI-BLAST, which is divided by 100, and
<italic>P
<sub>ib</sub>
</italic>
is the background probability of each of the 20 AAs. If for a given residues all WOP values equal zero, i.e.,
<italic>Pi</italic>
is a vector of 20 zeroes, then we use the average WOP values that are computed as the average over all residues of the same type in the training dataset for which the WOP values are non-zero. The selection of this conservation measure is motivated by results in
<xref rid="pone.0040104-Wang1" ref-type="bibr">[43]</xref>
.</p>
<p>Following
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
, we compute similarity scores that quantify similarity of a given input 20-mer and the epitope and non-epitope fragments in the corresponding training dataset; we adjust the training datasets for each fold in the cross-validation tests and we use Filtered40_BCPREDFrag dataset when testing on the SEQ194 dataset. The scores are based on the total number of identical substrings (multi-mers) between the two 20-mers, i.e., they count the number of the same AAs, the same 2-mers, 3-mers, etc. present in both fragments. Such scores were found to be the most effective among several possible similarity measures in
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
. We use the five highest scores when calculating similarity to epitope fragments and non-epitope fragments, respectively.</p>
<p>Using these above information, we considered the following 11 groups of features:</p>
<list list-type="order">
<list-item>
<p>
<italic>Secondary structure-based (8 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>content
<sub>ss</sub>
</italic>
is the content (fraction) of the residues in the input 20-mer that have a given predicted secondary structure
<italic>ss</italic>
 =  {helix (H), strand (E), coil (C)} (3 features).</p>
</list-item>
<list-item>
<p>
<italic>entropy_SS</italic>
 =  SUM
<italic>
<sub>ss = {helix,strand,coil}</sub>
</italic>
{
<italic>content
<sub>ss</sub>
</italic>
ln(
<italic>content
<sub>ss</sub>
</italic>
)}, which is the overall entropy of the predicted secondary structure in the input 20-mer (1 feature).</p>
</list-item>
<list-item>
<p>
<italic>NumSeg
<sub>ss</sub>
</italic>
is the number of segments of a given predicted secondary structure type
<italic>ss</italic>
in the input 20-mer. A segment is defined as a stretch of consecutive AAs with the same secondary structure. For example, for the predicted secondary structure “HHHCEEEEEEEECCCHHHCCCECC”,
<italic>NumSeg
<sub>H</sub>
</italic>
 = 2,
<italic>NumSeg
<sub>C</sub>
</italic>
 = 4,
<italic>NumSeg
<sub>E</sub>
</italic>
 = 2. (3 features).</p>
</list-item>
<list-item>
<p>
<italic>NumSeg_SS</italic>
is the total number of predicted secondary structure segments in the input 20-mer (1 feature).</p>
</list-item>
</list>
</list-item>
</list>
<p>We note that similar, segment-based features were successfully used in
<xref rid="pone.0040104-Mizianty1" ref-type="bibr">[44]</xref>
.</p>
<list list-type="order">
<list-item>
<p>
<italic>RSA-based (33 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>content
<sub>Bd/Ed</sub>
</italic>
is the content (fraction) of the residues in the input 20-mer that that are predicted to be buried (Bd) or solvent exposed (Ed) (2 features).</p>
</list-item>
<list-item>
<p>
<italic>entropy_RSA</italic>
 =  SUM
<italic>
<sub>i = {buried,exposed}</sub>
</italic>
{
<italic>content
<sub>i</sub>
</italic>
ln(
<italic>content
<sub>i</sub>
</italic>
)}, which is the overall entropy of the predicted solvent exposure (content of buried vs. solvent exposed residues) in the input 20-mer (1 feature).</p>
</list-item>
<list-item>
<p>
<italic>RSA
<sub>Bd/Ed</sub>
</italic>
is the average predicted RSA value for buried (Bd) or solvent exposed (Ed) residues in the input 20-mer (2 features).</p>
</list-item>
<list-item>
<p>
<italic>max/min_RSA_slide
<sub>n</sub>
</italic>
is the maximum/minimum value of predicted RSA averaged over a sliding window of size
<italic>n</italic>
 = 5,6, …,17,18 within the input 20-mer. We consider 14 sizes of sliding window and calculate both min and max values (14×2 = 28 features). This allows us to find smaller fragments of input 20-mer that are either solvent exposed or buried.</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>RAAP-based (30 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>avg_RAAP</italic>
is the average RAAP value of the input 20-mer (1 feature).</p>
</list-item>
<list-item>
<p>
<italic>sd_RAAP</italic>
is the standard deviation of RAAP values of the input 20-mer (1 feature).</p>
</list-item>
<list-item>
<p>
<italic>max/min_RAAP_slide
<sub>n</sub>
</italic>
is the maximum/minimum value of RAAP averaged over a sliding window of size
<italic>n</italic>
 = 5,6, …,17,18 within the input 20-mer (14×2 = 28 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>Conservation score-based (29 features.).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>avg_CON</italic>
is the average conservation score of the input 20-mer (1 feature).</p>
</list-item>
<list-item>
<p>
<italic>max/min_CON_slide
<sub>n</sub>
</italic>
is the maximum/minimum value of conservation score averaged over a sliding window of size
<italic>n</italic>
 = 5,6, …,17,18 within the input 20-mer (14×2 = 28 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>Secondary structure and RSA-based (12 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>Num
<sub>ss_Bd/Ed</sub>
</italic>
is the number of residues in the input 20-mer that have a given predicted secondary structure
<italic>ss</italic>
and which are predicted to be buried (Bd) or solvent exposed (Ed) (3×2 = 6 features).</p>
</list-item>
<list-item>
<p>
<italic>RSA
<sub>ss</sub>
</italic>
is the average predicted RSA value for the residues in the input 20-mer that are predicted to have secondary structure
<italic>ss</italic>
(3 features).</p>
</list-item>
<list-item>
<p>
<italic>RSA_max_segment
<sub>ss</sub>
</italic>
is the average predicted RSA value for the longest segment of a given predicted secondary structure type
<italic>ss</italic>
in the input 20-mer (3 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>Secondary structure and conservation score-based (6 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>CON
<sub>ss</sub>
</italic>
is the average conservation value for residues in the input 20-mer that have a given predicted secondary structure
<italic>ss</italic>
(3 features).</p>
</list-item>
<list-item>
<p>
<italic>CON_max_segment
<sub>ss</sub>
</italic>
is the average conservation value for the longest segment of a given predicted secondary structure type
<italic>ss</italic>
in the input 20-mer (3 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>Secondary structure and RAAP-based (6 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>RAAP
<sub>ss</sub>
</italic>
is the average RAAP value for residues in the input 20-mer that have a given predicted secondary structure
<italic>ss</italic>
(3 features).</p>
</list-item>
<list-item>
<p>
<italic>RAAP_max_segment
<sub>ss</sub>
</italic>
is the average RAAP value for the longest segment of a given predicted secondary structure type
<italic>ss</italic>
in the input 20-mer (3 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>RAAP and RSA-based (30 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>RAAP
<sub>Bd/Ed</sub>
</italic>
is the average RAAP value of the predicted buried (Bd) or solvent exposed (Ed) in the input 20-mer (2 features).</p>
</list-item>
<list-item>
<p>
<italic>avg_RAAP_max/min_RSA_sliden</italic>
, is the average RAAP value in a sliding window of size
<italic>n</italic>
 = 5,6, …,17,18 within the input 20-mer that has the maximum/minimum average predicted RSA value (14×2 = 28 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>RAAP and conservation score-based (28 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>avg_RAAP_max/min_CON_sliden</italic>
is the average RAAP value in a sliding window of size
<italic>n</italic>
 = 5,6, …,17,18 within the input 20-mer that has the maximum/minimum average conservation score value (14×2 = 28 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>Secondary structure, RAAP and RSA-based (6 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>RAAP
<sub>ss_Bd/Ed</sub>
</italic>
is the average RAAP value for residues in the input 20-mer that have a given predicted secondary structure
<italic>ss</italic>
and which are predicted to be buried (Bd) or solvent exposed (Ed) (6 features).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>
<italic>Similarity score-based (10 features).</italic>
</p>
<list list-type="bullet">
<list-item>
<p>
<italic>max_similarity_epitope
<sub>k</sub>
</italic>
is the
<italic>k</italic>
<sup>th</sup>
highest similarity score between the input 20-mer and the epitope fragments from the training dataset;
<italic>k</italic>
 = 1,2,3,4,5 (5 features).</p>
</list-item>
<list-item>
<p>
<italic>max_similarity_non-epitope
<sub>k</sub>
</italic>
is the
<italic>k</italic>
<sup>th</sup>
highest similarity score between the input 20-mer and the non-epitope fragments from the training dataset;
<italic>k</italic>
 = 1,2,3,4,5 (5 features).</p>
</list-item>
</list>
</list-item>
</list>
<p>
<xref ref-type="table" rid="pone-0040104-t001">Table 1</xref>
summarizes the considered 198 features, which are divided into the above mentioned 11 groups. While some of these features use the information that was previously considered to predict B-cell epitopes, including predicted secondary structure and RSA, RAAP and similarity scores, we also use conservation scores that were not used by the prior sequence-based predictors. Moreover, we propose a novel set of features that combine multiple types of information (such as predicted secondary structure and RSA; predicted secondary structure and conservation, etc.) and we use of sliding window to find fragments of the input 20-mer (such as fragments with low/high RAAP score, RSA value, etc.) that are relevant to the prediction of the B-cell epitopes.</p>
</sec>
<sec id="s2e">
<title>Feature selection and parameterization of the SVM model</title>
<p>The considered features may include features that are not relevant to the prediction of B-cell epitopes and which could be correlated/redundant with each other. We perform a wrapper-based (using the SVM model) feature selection, to accommodate for the above. We use the SVM model with the RBF kernel and we parameterized it using a grid search considering the complexity constant
<italic>C</italic>
and the
<italic>gamma</italic>
(spread of the RBF function) using all 198 features. Parameterization was done based on the 10-fold cross validation on the training BCPREDFrag dataset and we considered
<italic>C</italic>
 = 2
<sup>−2</sup>
,2
<sup>−1</sup>
…, 2
<sup>3</sup>
,2
<sup>4</sup>
and
<italic>gamma</italic>
 = 2
<sup>−11</sup>
,2
<sup>−10</sup>
…,2
<sup>−1</sup>
,2
<sup>0</sup>
. The selected parameters are
<italic>C</italic>
 = 2
<sup>3</sup>
and
<italic>gamma</italic>
 = 2
<sup>−10</sup>
, and we use these parameters through the entire feature selection process.</p>
<p>We first sort all features based on their average (over the ten training folds generated based on the 10 fold cross-validation on the training dataset) absolute biserial correlation coefficients (BCC). The BCC is defined as:</p>
<p>BCC = (
<italic>M
<sub>e</sub>
</italic>
-
<italic>M
<sub>ne</sub>
</italic>
)*sqrt(
<italic>n
<sub>e</sub>
</italic>
*
<italic>n
<sub>ne</sub>
</italic>
/
<italic>n</italic>
)/(
<italic>stdev</italic>
)</p>
<p>where
<italic>M
<sub>e</sub>
</italic>
and
<italic>M
<sub>ne</sub>
</italic>
are the mean values of the feature values for native epitopic and non-epitopic residues, respectively,
<italic>stdev</italic>
is the standard deviation of the feature,
<italic>n
<sub>e</sub>
</italic>
and
<italic>n
<sub>ne</sub>
</italic>
are the numbers of native epitopic and non-epitopic residues, respectively, and
<italic>n</italic>
is the total number of residues.</p>
<p>Next, we iteratively try to remove one feature at the time starting with the entire set of 198 sorted features and considering the least correlated features first. We calculate MCC for the 10-fold cross validation-based prediction of B-cell epitopes on the training (BCPREDFrag) dataset using the SVM classifier with a given set of features. We remove a given feature if this removal does not lower the MCC value. We repeat this until none of the features can be removed, i.e., removal of any feature leads to a decrease in the MCC. This type of feature selection was motivated by similar approaches used in related studies
<xref rid="pone.0040104-Chen3" ref-type="bibr">[45]</xref>
<xref rid="pone.0040104-Mizianty3" ref-type="bibr">[47]</xref>
.</p>
<p>Consequently, 84 features were retained, see
<xref ref-type="table" rid="pone-0040104-t001">Table 1</xref>
. A detailed list of the selected features is given in
<xref ref-type="supplementary-material" rid="pone.0040104.s001">Table S1</xref>
. Importantly, the selected features cover each of the considered 11 feature groups, which suggests that all considered groups contribute to the prediction of B-cell epitopes. The largest subset of the selected features concerns the RAAP scale, 60 out of the selected 84 features use the RAAP values. The arguably best feature, which has the highest absolute BCC of 0.47 (compared to the second-best feature with the absolute BCC = 0.4), is the
<italic>max_similarity_epitope
<sub>1</sub>
</italic>
. This feature quantifies to the highest similarity score against the database of training B-cell epitopes. This agrees with the results in
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
, where the authors demonstrate use of these similarity scores leads to relatively accurate predictions of the epitopes. The selected features also include 65 that are based on using sliding windows inside the 20-mers. This shows that the use of the sliding windows, which is proposed in this work, is beneficial when compared to the use of the entire 20-mer. Moreover, 44 of the selected features use information coming from multiple types of inputs, which points to the importance of the novel aspects introduced in this work. Finally, 21 features utilize information coming from the conservation scores, which indicates that this input, which we also introduced here, provides a valuable contribution.</p>
<p>We again parameterize the SVM model using the same grid search with the selected features. The selected parameters are
<italic>C</italic>
 = 2
<sup>0</sup>
and
<italic>gamma</italic>
 = 2
<sup>−9</sup>
, and we used these parameters to implement our BEST method and to perform predictions on all considered datasets.</p>
</sec>
<sec id="s2f">
<title>Calculation of propensity scores</title>
<p>The real-value outputs generated by the SVM model, which are calculated for the overlapping 20-mers extracted from the input protein chain and which approximate the probability of a given 20-mer to be a B-cell epitope, are used to calculate propensity of each AA to form of a B-cell epitope. We assign the same SVM score to every AA in a given 20-mer, which means that every AA in the input chain has between 1 (for the residues at either terminus) and 20 (for residues 20 or more positions away from a terminus) SVM scores assigned to it; these scores come from the overlapping 20-mers. We consider four schemes to calculate the propensity from these scores:</p>
<list list-type="bullet">
<list-item>
<p>
<italic>max scheme</italic>
in which we use the maximal score as the propensity. This scheme assumes that a given AA is likely to be an epitope if it was predicted as such (has a high SVM score) in even one 20-mer that includes it.</p>
</list-item>
<list-item>
<p>
<italic>average scheme</italic>
in which we use an average score. In this case, we implement a consensus-like decision where the propensity is based on all corresponding scores generated by the SVM.</p>
</list-item>
<list-item>
<p>
<italic>median scheme</italic>
in which we use a median score. This is again a consensus-like prediction but in this case we use one of the SVM scores, instead of calculating a new average value.</p>
</list-item>
<list-item>
<p>
<italic>distance scheme</italic>
where we calculate an average score but considering only a subset of the SVM scores. This is a novel approach in which we use only higher quality SVM scores. We note that the predictions associated with either low or high scores are usually more accurate compared with the predictions that have scores close to 0.5, which is the cutoff to separate the two outcomes; the 20-mers with scores <0.5 and >0.5 are assumed not to be epitopes and to be epitopes, correspondingly. This was shown for related SVM-based predictors
<xref rid="pone.0040104-Chen4" ref-type="bibr">[48]</xref>
,
<xref rid="pone.0040104-Zhang2" ref-type="bibr">[49]</xref>
. Therefore, we use only
<italic>k</italic>
 = 1,2, …,20 scores that are the farthest from 0.5 to compute the average; for
<italic>k</italic>
 = 20 this is equivalent to computing the average-scheme. We estimate the best value of
<italic>k</italic>
empirically; see Section “Selection of the method to calculate propensity scores”.</p>
</list-item>
</list>
</sec>
</sec>
<sec id="s3">
<title>Results</title>
<sec id="s3a">
<title>Comparison on the fragment-based datasets</title>
<p>We evaluate the results generated by our SVM models, using both the model with all 198 features and the model with the selected 84 features, on two benchmark fragment-based datasets: BCPREDFrag and ChenFrag. These datasets include 20-mers of epitopes and non-epitopes, which were generated by truncation-and-extension. We compare our predictions with the results of recent predictors, including the method by Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
, BCPred
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
, and COBEpro
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
.
<xref ref-type="table" rid="pone-0040104-t002">Table 2</xref>
summarizes the results based on the 10-fold cross validation on the BCPREDFrag dataset, while
<xref ref-type="table" rid="pone-0040104-t003">Table 3</xref>
shows results on the ChenFrag dataset; we use 10-fold or 5-fold cross validation to mimic the tests from the original papers.
<xref ref-type="table" rid="pone-0040104-t002">Table 2</xref>
indicates that our SVM model with 198 features achieves an AUC of 0.81, accuracy of 74.5% and MCC of 0.53 on the BCPREDFrag dataset. The model with the selected 84 features achieves similar predictive quality, with AUC, accuracy, and MCC of 0.81, 74.0% and 0.55, respectively. The same level of similarity between these two approaches is observed on the ChenFrag data set. This demonstrates that the reduction of the feature set does not worsen the overall quality of the prediction. We note that the model with more input features gives a better sensitivity as a trade-off for reduced specificity, which means that it predicts more native epitope fragments but with a higher number of false positives.</p>
<p>Compared with the other considered predictors, our SVM models achieve the best predictions with an AUC of 0.81 and 0.85 and the highest MCC of 0.55 and 0.63 on the BCPREDFrag and ChenFrag datasets, respectively. The second-best predictor, COBEpro, obtains an AUC of 0.77 and 0.83 and MCC of 0.45 and 0.59 on the BCPREDFrag and ChenFrag datasets, respectively. Our models are characterized by high specificity (they rarely confuse non-epitopes for epitopes), and sensitivity which is similar to the sensitivity offered by existing methods. The sensitivity in the 0.5 to 0.6 range means that about 50 to 60% of native epitopes are correctly predicted. The high precision offered by our SVM model with 84 features means that virtually all of the predicted epitopes are in fact correct. This means that our SVM-based approach provides predictions that are conservative, i.e., it predicts a subset of native epitopes but with high quality. We observe that the results on the ChenFrag dataset are better than for the BCPREDFrag dataset. This is since the former dataset includes chains with higher similarity (with each other) when compared with the latter dataset.</p>
</sec>
<sec id="s3b">
<title>Improvements due to the inclusion of novel features</title>
<p>We analyze the impact of the novel aspects that were introduced in this study, including the new features and the fact that we effectively combine multiple features, including new and previously proposed features. We compare the results of our SVM-based model with 84 features with the results obtained when using the RAAP scale from Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
and the similarity measure introduced in
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
. To do that, we developed two SVM-based predictors that use the
<italic>avg_RAAP</italic>
feature (denoted as
<italic>RAAP model</italic>
) and the
<italic>max_similarity_epitope
<sub>1</sub>
</italic>
feature (
<italic>MaxSimilarity model</italic>
), respectively. These are the two best ranked features (see
<xref ref-type="supplementary-material" rid="pone.0040104.s001">Table S1</xref>
) that utilize the concepts introduced in these two works. These two models were parameterized on the training BCPREDFrag dataset in the same way as the SVM models proposed in this work. Consequently, these two models are the same as the proposed SVM model, except for the input features. The ROC curves of the three models on BCPREDFrag and ChenFrag datasets are shown in
<xref ref-type="fig" rid="pone-0040104-g002">Figure 2</xref>
.</p>
<p>We observe that our model provides higher sensitivity (TP-rate) for the entire range of FP-rates (FP-rate  = 1-specificity). The AUC values of the RAAP and MaxSimilarity models on the BCPREDFrag dataset are 0.73 and 0.72, respectively, compared to 0.81 achieved by our model with 84 features. Similarly, the two single feature-based models obtain AUC equal to 0.74 and 0.79 on the ChenFrag dataset, while we obtain 0.85 when using all 84 features. This is a relatively large increase by 100%*(0.81–0.73)/0.5 = 16% and by 100%*(0.85–0.79)/0.5 = 12% on the BCPREDFrag and ChenFrag datasets, respectively, given that AUC values range between 0.5 (for random predictions) and 1 (for perfect predictions). We attribute this increase to the use of novel features and the combination of the new and existing features that are implemented in our approach.</p>
<p>We also investigate contributions of individual feature groups, which are defined in
<xref ref-type="table" rid="pone-0040104-t001">Table 1</xref>
.
<xref ref-type="table" rid="pone-0040104-t004">Table 4</xref>
shows the AUC values when only the selected features in each of the considered feature groups are utilized. Almost all the considered feature groups lead to an AUC above 0.5, which means that these models are better than random and that the corresponding features contribute to the final model that fuses all these features; the only exception are the conservation score-based features which on its own reach AUC of 0.5. Moreover, we observe that our approach to expand ideas from the prior works is beneficial. For instance, the use of the 7 selected similarity score-derived features leads to improvements when compared to using only the one
<italic>max_similarity_epitope
<sub>1</sub>
</italic>
feature, which is based on
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
; the corresponding AUC values are 0.76 vs. 0.72 on the BCPREDFrag dataset and 0.82 vs. 0.79 on the ChenFrag dataset. Also, the use of the combined set of 84 features results in higher AUCs compared to the best performing individual feature group. Specifically, the best performing similarity score-based group provides AUC values lower by 0.053 and 0.024 on the BCPREDFrag and ChenFrag dataset, respectively, when compared to our SVM that used 84 features.</p>
<p>We further analyze the similarity-based scores between the 20-mers from the BCPREDFrag dataset and the library of the epitope fragments, i.e., the
<italic>max_similarity_epitope
<sub>1</sub>
</italic>
feature. We plot the values of this feature (see
<xref ref-type="fig" rid="pone-0040104-g003">Figure 3</xref>
) separately for the native epitope (using black line) and non-epitope (gray line) fragments. The plots demonstrate, as expected, that native epitopes have overall substantially higher similarity with each other compared to the similarity between non-epitopes and epitopes. The mean and variance of the scores for the native epitopic fragments are 45.8 and 1455.7, respectively, while they are 16.4 and 13.9 for the non-epitopic fragments. However, about 300 native epitopic fragments have scores that are low (<20) and comparable to the scores for the non-epitopic fragments. These fragments cannot be correctly predicted using the similarity score alone. We note that there are only a few non-epitopic 20-mers that have high similarity to the epitopic fragments. This provides a potential explanation for the high specificity offered by our SVM model.</p>
</sec>
<sec id="s3c">
<title>Selection of the method to calculate propensity scores</title>
<p>We compare the predictive quality for the considered four methods (see section “Calculation of propensity scores”) that calculate the propensity of residues in a protein sequence to form of a B-cell epitope based on scores predicted by our SVM model with 84 features using the sliding window of 20-mers. In other words, we chunk the input protein using a sliding window of size 20, process each window using our SVM model and combine the scores generates by the SVM using each of the four methods (
<italic>maximum, average, median</italic>
and
<italic>distance scheme</italic>
) to predict a full protein chain. First, we parameterize the
<italic>distance scheme</italic>
to select the number of scores,
<italic>k</italic>
, that will be used, see
<xref ref-type="fig" rid="pone-0040104-g004">Figure 4</xref>
. We perform the calculations on the SEQ194 dataset (we use the Filter40_BCPREDFrag to generate the SVM model) and we use AUC and success rate as the evaluation criteria. The results indicate that the predictive quality is higher when we choose
<italic>k</italic>
between 10 and 16. Using smaller
<italic>k</italic>
would remove some of the useful scores and using higher
<italic>k</italic>
would include too many scores which may include some poor quality predictions. We compare the
<italic>distance scheme</italic>
with
<italic>k</italic>
 = 10 and
<italic>k</italic>
 = 16 with the other three approaches in
<xref ref-type="table" rid="pone-0040104-t005">Table 5</xref>
. The use of the
<italic>median scheme</italic>
results in the highest success rate at 60.8% and the third-best AUC of 0.55. The application of the
<italic>distance scheme</italic>
with
<italic>k</italic>
 = 16 leads to the highest AUC equal 0.57 and the second-best success rate of 60.3%. Consequently, we select this
<italic>distance scheme</italic>
to compute the propensities and to implement our BEST method. Our predictor can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://biomine.ece.ualberta.ca/BEST/">http://biomine.ece.ualberta.ca/BEST/</ext-link>
.</p>
</sec>
<sec id="s3d">
<title>Comparison on the sequence-based datasets</title>
<p>We compare our BEST method, which uses the SVM model with 84 features generated with the Filtered40_BCPREDFrag dataset and the distance scheme with
<italic>k</italic>
 = 16, with recent representative sequence-based predictors of B-cell epitopes, including ABCPred
<xref rid="pone.0040104-Saha1" ref-type="bibr">[19]</xref>
, COBEpro
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
, BayesB
<xref rid="pone.0040104-Wee1" ref-type="bibr">[23]</xref>
, and CBTOPE
<xref rid="pone.0040104-Ansari1" ref-type="bibr">[25]</xref>
. We also include the results from the structure-based predictor Epitopia
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
,
<xref rid="pone.0040104-Rubinstein2" ref-type="bibr">[35]</xref>
and the alternative version of our method that uses
<italic>k</italic>
 = 10. Since some methods only predict epitopic fragments in a protein chain, we computed the propensities for each amino acid as follows:</p>
<list list-type="bullet">
<list-item>
<p>For Epitopia, we utilized the immunogenicity scores generated by the web server at
<ext-link ext-link-type="uri" xlink:href="http://epitopia.tau.ac.il/">http://epitopia.tau.ac.il/</ext-link>
, and we normalize them into [0,1] interval.</p>
</list-item>
<list-item>
<p>For ABCPred, we used the web server at
<ext-link ext-link-type="uri" xlink:href="http://www.imtech.res.in/raghava/abcpred/">http://www.imtech.res.in/raghava/abcpred/</ext-link>
with default parameters. The server returns predicted epitopic fragments with their scores. For a given residue, we used the maximal score from all fragments where this residue is included.</p>
</list-item>
<list-item>
<p>For COBEpro, we used the web server at
<ext-link ext-link-type="uri" xlink:href="http://scratch.proteomics.ics.uci.edu/">http://scratch.proteomics.ics.uci.edu/</ext-link>
and we followed the procedure from
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
.</p>
</list-item>
<list-item>
<p>For BayesB, we performed predictions based on the web server at
<ext-link ext-link-type="uri" xlink:href="http://www.immunopred.org/bayesb/index.html">http://www.immunopred.org/bayesb/index.html</ext-link>
. This method was designed to predict linear B-cell epitopes and it returns a list of predicted epitopes as 20-mers, with no scores. We assumed that a given residue is a B-cell epitope if it appears in at least one of the predicted 20-mers; otherwise, it is assumed not be an epitope. We could not calculate AUC for BayesB since this method does not return scores.</p>
</list-item>
<list-item>
<p>For CBTOPE, we calculated the predictions with the web server at
<ext-link ext-link-type="uri" xlink:href="http://www.imtech.res.in/raghava/cbtope/">http://www.imtech.res.in/raghava/cbtope/</ext-link>
using default parameters. We divided the scores generated by the server, which are in 0 to 9 range, by 10 to normalize them into [0, 1] interval.</p>
</list-item>
</list>
<p>The comparison is performed on the SEQ194 dataset, see
<xref ref-type="table" rid="pone-0040104-t006">Table 6</xref>
. For Epitopia, ABCPred and COBEpro we show the predictions that were generated with the author-provided web servers together with the results on the same dataset from
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
. We also evaluate significance of differences between our predictor and the other methods using their web server predictions. We select 100 chains at random from the SEQ194 dataset and repeat the evaluation 10 times using these subsets of sequences. We use paired-t-test to compare the resulting AUC values and the differences are assumed significant if
<italic>p</italic>
-value <0.05. The corresponding average AUCs and their standard deviations are shown in
<xref ref-type="fig" rid="pone-0040104-g005">Figure 5</xref>
.</p>
<p>When compared with the sequence-based methods using
<xref ref-type="table" rid="pone-0040104-t006">Table 6</xref>
, BEST (which uses
<italic>k</italic>
 = 16) achieves the best AUC  = 0.57. The second-best ABCPred and COBEpro methods achieve AUC around 0.55. The improvements in AUC offered by BEST have moderate magnitude but these differences are significant when compared with all chain-based methods including ABCPred, CBTOPE, and COBEpro. The structure-based Epitopia outperforms our sequence-based approach and obtains AUC of about 0.57 (or 0.59 in the original paper). The corresponding ROC curves are shown in
<xref ref-type="fig" rid="pone-0040104-g006">Figure 6</xref>
. We note that BEST offers highest TP-rates (sensitivity) for higher FP-rates, while our SVM-based design with distance scheme with
<italic>k</italic>
 = 10 offers highest sensitivity for low FP-rates. Structure based Epitopia is the only method that outperforms our SVM-based approaches for FP-rates above 0.6. However, BEST is outperformed by COBEpro, BayesB, ABCPred, and Epitopia when considering the success rates. We note that BayesB obtains high success rate at 80.9%. However, this is a byproduct the fact that this method substantially overpredicts epitopes; 97.6% residues are predicted as epitopes by the BayesB method. We also compare with a “random” predictor, which uses a randomly generated score between 0 and 1 for each 20-mer fragment and which calculates the propensity scores using the distance scheme with
<italic>k</italic>
 = 16. When evaluated with AUC, the random method is significantly worse than our BEST (
<italic>p</italic>
-value  = 5.5*10
<sup>−8</sup>
).</p>
<p>We also perform a second test on the SEQ19 dataset. This dataset is arguably too small to assess statistical significance, but it allows gauging the overall predictive quality. Our BEST method achieves AUC of 0.601, while ABCPred and COBEpro, which are the top two sequence-based runner-up methods on the SEQ149 dataset, obtain AUC of 0.541 and 0.525, respectively. The corresponding ROC curves are given in
<xref ref-type="fig" rid="pone-0040104-g007">Figure 7</xref>
and they show that BEST provides higher sensitivity (TP-rate) for the FP-rates below 0.8 when compared to the other two sequence-based predictors.</p>
</sec>
<sec id="s3e">
<title>Case studies</title>
<p>We present two case studies to visualize the propensity profiles generated by various considered B-cell epitope predictors. We selected two proteins from the SEQ194 dataset, a capsid protein (UniProt ID: P16489) with one short continuous epitope, and anti-repression transactivator protein (UniProt ID: P20869) that has a discontinuous B-cell epitope composed of two segments.
<xref ref-type="fig" rid="pone-0040104-g008">Figure 8</xref>
shows the propensities predicted by ABCPred, COBEpro, Epitopia and BEST together with the location of the native epitopes. The propensity profiles generated by BEST are smooth dues to the use of averaging of the SVM scores and the peaks denote predicted epitopes. BEST gives a peak around the location of the native epitope for the capsid protein and another peak in the vicinity of the N-terminus in that chain; the latter is a likely false positive prediction; see
<xref ref-type="fig" rid="pone-0040104-g008">Figure 8A</xref>
. For the anti-repression transactivator protein (see
<xref ref-type="fig" rid="pone-0040104-g008">Figure 8B</xref>
) our method correctly predicts the shorter of the two epitope segments and provides slightly elevated propensities for the longer segment. ABCpred managed to quite well identify the epitopes in the latter protein, but it could not find the epitope in the capsid protein. COBEpro and Epitopia find the longer epitope fragment in the anti-repression transactivator and several (potentially) false positive epitopes in both proteins. We note that these results should not be assumed to be typical, i.e., to represent “average” predictive quality across these methods which is summarized in
<xref ref-type="table" rid="pone-0040104-t006">Table 6</xref>
; they are presented to contrast the overall characteristics of the propensity profiles generated by these methods.</p>
</sec>
</sec>
<sec id="s4">
<title>Discussion</title>
<p>We propose a new approach for the prediction of B-cell epitopes from antigen sequences. Our BEST method predicts epitopes from full protein chains using a novel approach based on averaging selected scores generated from 20-mers by an SVM-based predictor. We use a comprehensive and custom designed set of inputs that are generated by fusing information derived from the protein chain, similarity to known (training) epitopes, sequence conservation and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets (including an independent test set of 194 antigens) demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred
<xref rid="pone.0040104-Saha1" ref-type="bibr">[19]</xref>
, method by Chen et al.
<xref rid="pone.0040104-Chen2" ref-type="bibr">[20]</xref>
, BCPred
<xref rid="pone.0040104-ElManzalawy1" ref-type="bibr">[21]</xref>
, COBEpro
<xref rid="pone.0040104-Sweredoski1" ref-type="bibr">[22]</xref>
, BayesB
<xref rid="pone.0040104-Wee1" ref-type="bibr">[23]</xref>
, and CBTOPE
<xref rid="pone.0040104-Ansari1" ref-type="bibr">[25]</xref>
, when considering the predictions from full chains and also from the chain fragments. We show that the improvements came from the design and use of new inputs, which include conservation scores. These scores and other inputs were combined together to calculate fused features. These individual features combine information from multiple inputs, e.g., one feature fuses information from the predicted secondary structure, sequence and sequence conservation. We also present a couple of case studies to demonstrate the propensity profiles generated by BEST.</p>
<p>The predictive quality offered by our method can be potentially further improved. One possibility is to first use the antigen sequence to predict its fold, which would be than used as an input. This is motivated by superior predictive performance of the structure-based predictors when compared to the sequence-based methods
<xref rid="pone.0040104-Yang1" ref-type="bibr">[3]</xref>
,
<xref rid="pone.0040104-Zhang1" ref-type="bibr">[31]</xref>
,
<xref rid="pone.0040104-Rubinstein1" ref-type="bibr">[34]</xref>
. The structure could be also approximated with the use of sequence-predicted structural characteristics, such as contact numbers or B-factors
<xref rid="pone.0040104-Kurgan1" ref-type="bibr">[50]</xref>
, which are utilized by some of the structure-based predictors
<xref rid="pone.0040104-HasteAndersen1" ref-type="bibr">[27]</xref>
,
<xref rid="pone.0040104-Liang1" ref-type="bibr">[30]</xref>
,
<xref rid="pone.0040104-Liu1" ref-type="bibr">[32]</xref>
. Another worthwhile input is disorder, and in particular molecular recognition features that are important for protein recognition
<xref rid="pone.0040104-Mohan1" ref-type="bibr">[51]</xref>
and which can be predicted from the sequence
<xref rid="pone.0040104-Mszros1" ref-type="bibr">[52]</xref>
,
<xref rid="pone.0040104-MiriDisfani1" ref-type="bibr">[53]</xref>
. However, the main limiting factor is the fact that only a small fraction (several thousand) of the epitopes is known and can be used to build predictive models compared to about a trillion antibodies in our body, when excluding T cell receptors
<xref rid="pone.0040104-Yang1" ref-type="bibr">[3]</xref>
. We believe that major improvements can be accomplished only when additional data becomes available.</p>
<p>BEST can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://biomine.ece.ualberta.ca/BEST/">http://biomine.ece.ualberta.ca/BEST/</ext-link>
.</p>
</sec>
<sec sec-type="supplementary-material" id="s5">
<title>Supporting Information</title>
<supplementary-material content-type="local-data" id="pone.0040104.s001">
<label>Table S1</label>
<caption>
<p>
<bold>List of the 84 selected features.</bold>
The features are sorted according to the average (over the ten training folds generated based on the 10 fold cross-validation on the training dataset) absolute biserial correlation coefficient (BCC).</p>
<p>(PDF)</p>
</caption>
<media xlink:href="pone.0040104.s001.pdf">
<caption>
<p>Click here for additional data file.</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>Dr. Kurgan gratefully acknowledges the support received during his visit at the Center for Computational Biology and Bioinformatics at the Indiana University School of Medicine. The authors thank Drs. G.P.S. Raghava, Ke Chen and Tuo Zhang for advice on using their software, and Mr. Nimrod Rubinstein and Dr. Hifzur Rahman Ansari for providing datasets. Fruitful discussions with Drs. Gang Hu and Kui Wang are gratefully acknowledged.</p>
</ack>
<fn-group>
<fn fn-type="COI-statement">
<p>
<bold>Competing Interests: </bold>
Co-author Lukasz Kurgan is a PLoS ONE Editorial Board member. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.</p>
</fn>
<fn fn-type="financial-disclosure">
<p>
<bold>Funding: </bold>
This work was supported by National Science Foundation of China (NSFC) grants 31050110432 and 31150110577 to LK and JR, National Institutes of Health grant GM R01 085003 to YZ, and by the Discovery grant 298328 from NSERC (National Science and Engineering Research Council) Canada to LK. JR was also supported by the International Development Research Center, Ottawa, Canada (No. 104519-010) and Tianjin science and technology support project 08ZCHHZ00200. JG was supported by the Fundamental Research Funds for the Central Universities grant 65011491. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="pone.0040104-Chen1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Rayner</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>KH</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Advances of bioinformatics tools applied in virus epitopes prediction.</article-title>
<source>Virol Sin</source>
<volume>26</volume>
<fpage>1</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">21331885</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Beck1">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beck</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Klinguer-Hamour</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Bussat</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Champion</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Haeuw</surname>
<given-names>JF</given-names>
</name>
<etal></etal>
</person-group>
<year>2007</year>
<article-title>Peptides as tools and drugs for immunotherapies.</article-title>
<source>J Pept Sci</source>
<volume>13</volume>
<fpage>588</fpage>
<lpage>602</lpage>
<pub-id pub-id-type="pmid">17602441</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Yang1">
<label>3</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>X</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>An introduction to epitope prediction methods and software.</article-title>
<source>Rev Med Virol</source>
<volume>19</volume>
<fpage>77</fpage>
<lpage>96</lpage>
<pub-id pub-id-type="pmid">19101924</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Tong1">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tong</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>TW</given-names>
</name>
<name>
<surname>Ranganathan</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Methods and protocols for prediction of immunogenic epitopes.</article-title>
<source>Brief Bioinform</source>
<volume>8</volume>
<fpage>96</fpage>
<lpage>108</lpage>
<pub-id pub-id-type="pmid">17077136</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Blythe1">
<label>5</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blythe</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Flower</surname>
<given-names>DR</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Benchmarking B cell epitope prediction: underperformance of existing methods.</article-title>
<source>Protein Sci</source>
<volume>14</volume>
<fpage>246</fpage>
<lpage>248</lpage>
<pub-id pub-id-type="pmid">15576553</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Pellequer1">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pellequer</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Westhof</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Van Regenmortel</surname>
<given-names>MH</given-names>
</name>
</person-group>
<year>1991</year>
<article-title>Predicting location of continuous epitopes in proteins from their primary structures.</article-title>
<source>Methods Enzymol</source>
<volume>203</volume>
<fpage>176</fpage>
<lpage>201</lpage>
<pub-id pub-id-type="pmid">1722270</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Flower1">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Flower</surname>
<given-names>DR</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Immunoinformatics and the in silico prediction of immunogenicity. An introduction.</article-title>
<source>Methods Mol Biol</source>
<volume>409</volume>
<fpage>1</fpage>
<lpage>15</lpage>
<pub-id pub-id-type="pmid">18449989</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Hopp1">
<label>8</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hopp</surname>
<given-names>TP</given-names>
</name>
<name>
<surname>Woods</surname>
<given-names>KR</given-names>
</name>
</person-group>
<year>1981</year>
<article-title>Prediction of protein antigenic determinants from amino acid sequences.</article-title>
<source>Proc Natl Acad Sci</source>
<volume>78</volume>
<fpage>3824</fpage>
<lpage>3828</lpage>
<pub-id pub-id-type="pmid">6167991</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Welling1">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Welling</surname>
<given-names>GW</given-names>
</name>
<name>
<surname>Weijer</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>van der Zee</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Welling-Wester</surname>
<given-names>S</given-names>
</name>
</person-group>
<year>1985</year>
<article-title>Prediction of sequential antigenic regions in proteins.</article-title>
<source>FEBS Lett</source>
<volume>188</volume>
<fpage>215</fpage>
<lpage>218</lpage>
<pub-id pub-id-type="pmid">2411595</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Karplus1">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karplus</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Schulz</surname>
<given-names>GE</given-names>
</name>
</person-group>
<year>1985</year>
<article-title>Prediction of chain flexibility in proteins: a tool for the selection of peptide antigen.</article-title>
<source>Naturwissenschaften</source>
<volume>72</volume>
<fpage>212</fpage>
<lpage>213</lpage>
</element-citation>
</ref>
<ref id="pone.0040104-Parker1">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Parker</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hodges</surname>
<given-names>RS</given-names>
</name>
</person-group>
<year>1986</year>
<article-title>New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: Correlation of predicted surface residues with antigenicity and X-rayderived accessible sites.</article-title>
<source>Biochemistry</source>
<volume>25</volume>
<fpage>5425</fpage>
<lpage>5432</lpage>
<pub-id pub-id-type="pmid">2430611</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Kolaskar1">
<label>12</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kolaskar</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Tongaonkar</surname>
<given-names>PC</given-names>
</name>
</person-group>
<year>1990</year>
<article-title>A semi empirical method for prediction of antigenic determinants on protein antigens.</article-title>
<source>FEBS Lett</source>
<volume>276</volume>
<fpage>172</fpage>
<lpage>174</lpage>
<pub-id pub-id-type="pmid">1702393</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Pellequer2">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pellequer</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Westhof</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Van Regenmortel</surname>
<given-names>MH</given-names>
</name>
</person-group>
<year>1993</year>
<article-title>Correlation between the location of antigenic sites and the prediction of turns in proteins.</article-title>
<source>Immunol Lett</source>
<volume>36</volume>
<fpage>83</fpage>
<lpage>99</lpage>
<pub-id pub-id-type="pmid">7688347</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Pellequer3">
<label>14</label>
<element-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Pellequer</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Westhof</surname>
<given-names>E</given-names>
</name>
</person-group>
<year>1993</year>
<article-title>PREDITOP: a program for antigenicity prediction.</article-title>
<source>J Mol Graph 11: 204–210, 191–192</source>
</element-citation>
</ref>
<ref id="pone.0040104-Alix1">
<label>15</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alix</surname>
<given-names>AJ</given-names>
</name>
</person-group>
<year>1999</year>
<article-title>Predictive estimation of protein linear epitopes by using the program PEOPLE.</article-title>
<source>Vaccine</source>
<volume>18</volume>
<fpage>311</fpage>
<lpage>314</lpage>
<pub-id pub-id-type="pmid">10506656</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Odorico1">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Odorico</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pellequer</surname>
<given-names>JL</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>BEPITOPE: predicting the location of continuous epitopes and patterns in proteins.</article-title>
<source>J Mol Recognit</source>
<volume>16</volume>
<fpage>20</fpage>
<lpage>22</lpage>
<pub-id pub-id-type="pmid">12557235</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Larsen1">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Larsen</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Lund</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>M</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Improved method for predicting linear B-cell epitopes.</article-title>
<source>Immunome Res</source>
<volume>2</volume>
<fpage>2</fpage>
<pub-id pub-id-type="pmid">16635264</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Sllner1">
<label>18</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Söllner</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Mayer</surname>
<given-names>B</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Machine learning approaches for prediction of linear B-cell epitopes on proteins.</article-title>
<source>J Mol Recognit</source>
<volume>19</volume>
<fpage>200</fpage>
<lpage>208</lpage>
<pub-id pub-id-type="pmid">16598694</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Saha1">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saha</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Raghava</surname>
<given-names>GP</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins.</article-title>
<volume>65</volume>
<fpage>40</fpage>
<lpage>48</lpage>
</element-citation>
</ref>
<ref id="pone.0040104-Chen2">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Prediction of linear B-cell epitopes using amino acid pair antigenicity scale.</article-title>
<source>Amino Acids</source>
<volume>33</volume>
<fpage>423</fpage>
<lpage>428</lpage>
<pub-id pub-id-type="pmid">17252308</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-ElManzalawy1">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>El-Manzalawy</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Dobbs</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Honavar</surname>
<given-names>V</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>Predicting linear B-cell epitopes using string kernels.</article-title>
<source>J Mol Recognit</source>
<volume>21</volume>
<fpage>243</fpage>
<lpage>255</lpage>
<pub-id pub-id-type="pmid">18496882</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Sweredoski1">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sweredoski</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Baldi</surname>
<given-names>P</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>COBEpro: a novel system for predicting continuous B-cell epitopes.</article-title>
<source>Protein Eng Des Sel</source>
<volume>22</volume>
<fpage>113</fpage>
<lpage>120</lpage>
<pub-id pub-id-type="pmid">19074155</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Wee1">
<label>23</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wee</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Simarmata</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kam</surname>
<given-names>YW</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>LF</given-names>
</name>
<name>
<surname>Tong</surname>
<given-names>JC</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>SVM-based prediction of linear B-cell epitopes using Bayes feature extraction.</article-title>
<source>BMC Genomics</source>
<volume>11</volume>
<fpage>S21</fpage>
<pub-id pub-id-type="pmid">21143805</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Altschul1">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Schäffer</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<etal></etal>
</person-group>
<year>1997</year>
<article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</article-title>
<source>Nucleic Acids Res</source>
<volume>25</volume>
<fpage>3389</fpage>
<lpage>3402</lpage>
<pub-id pub-id-type="pmid">9254694</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Ansari1">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ansari</surname>
<given-names>HR</given-names>
</name>
<name>
<surname>Raghava</surname>
<given-names>GP</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>Identification of conformational B-cell Epitopes in an antigen from its primary sequence.</article-title>
<source>Immunome Res</source>
<volume>6</volume>
<fpage>6</fpage>
<pub-id pub-id-type="pmid">20961417</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-KulkarniKale1">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kulkarni-Kale</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Bhosle</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kolaskar</surname>
<given-names>AS</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>CEP: a conformational epitope prediction server.</article-title>
<source>Nucleic Acids Res</source>
<volume>33</volume>
<fpage>W168</fpage>
<lpage>W171</lpage>
<pub-id pub-id-type="pmid">15980448</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-HasteAndersen1">
<label>27</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Haste Andersen</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lund</surname>
<given-names>O</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Prediction of residues in discontinuous B-cell epitopes using protein 3D structures.</article-title>
<source>Protein Sci</source>
<volume>15</volume>
<fpage>2558</fpage>
<lpage>2567</lpage>
<pub-id pub-id-type="pmid">17001032</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Sun1">
<label>28</label>
<element-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<year>2009</year>
<article-title>SEPPA: a computational server for spatial epitope prediction of protein antigens.</article-title>
<source>Nucleic Acids Res 37(Web Server issue)</source>
<fpage>W612</fpage>
<lpage>616</lpage>
</element-citation>
</ref>
<ref id="pone.0040104-Sweredoski2">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sweredoski</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Baldi</surname>
<given-names>P</given-names>
</name>
</person-group>
<year>2008</year>
<article-title>PEPITO: improved discontinuous B-cell epitope prediction using multiple distance thresholds and half sphere exposure.</article-title>
<source>Bioinformatics</source>
<volume>24</volume>
<fpage>1459</fpage>
<lpage>1460</lpage>
<pub-id pub-id-type="pmid">18443018</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Liang1">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liang</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Standley</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zacharias</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<year>2010</year>
<article-title>EPSVR and EPMeta: prediction of antigenic epitopes using support vector regression and multiple server results.</article-title>
<source>BMC Bioinformatics</source>
<volume>11</volume>
<fpage>381</fpage>
<pub-id pub-id-type="pmid">20637083</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Zhang1">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zou</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<year>2011</year>
<article-title>Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature.</article-title>
<source>BMC Bioinformatics</source>
<volume>12</volume>
<fpage>341</fpage>
<pub-id pub-id-type="pmid">21846404</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Liu1">
<label>32</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Prediction of discontinuous B-cell epitopes using logistic regression and structural information.</article-title>
<source>J Proteomics Bioinform</source>
<volume>4</volume>
<fpage>010</fpage>
<lpage>015</lpage>
</element-citation>
</ref>
<ref id="pone.0040104-Zhao1">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
</person-group>
<year>2010</year>
<article-title>Mining for the antibody-antigen interacting associations that predict the B cell epitopes.</article-title>
<source>BMC Struct Biol</source>
<volume>10</volume>
<fpage>S6</fpage>
<pub-id pub-id-type="pmid">20487513</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Rubinstein1">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rubinstein</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Mayrose</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Martz</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Pupko</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Epitopia: a web-server for predicting B-cell epitopes.</article-title>
<source>BMC Bioinformatics</source>
<volume>10</volume>
<fpage>287</fpage>
<pub-id pub-id-type="pmid">19751513</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Rubinstein2">
<label>35</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rubinstein</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Mayrose</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Pupko</surname>
<given-names>T</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>A machine-learning approach for predicting B-cell epitopes.</article-title>
<source>Mol Immunol</source>
<volume>46</volume>
<fpage>840</fpage>
<lpage>847</lpage>
<pub-id pub-id-type="pmid">18947876</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Faraggi1">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Faraggi</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Xue</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network.</article-title>
<source>Proteins</source>
<volume>74</volume>
<fpage>847</fpage>
<lpage>856</lpage>
<pub-id pub-id-type="pmid">18704931</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Dor1">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dor</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y</given-names>
</name>
</person-group>
<year>2007</year>
<article-title>Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training.</article-title>
<source>Proteins</source>
<volume>66</volume>
<fpage>838</fpage>
<lpage>845</lpage>
<pub-id pub-id-type="pmid">17177203</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Saha2">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saha</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bhasin</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Raghava</surname>
<given-names>GP</given-names>
</name>
</person-group>
<year>2005</year>
<article-title>Bcipep: a database of B-cell epitopes.</article-title>
<source>BMC Genomics</source>
<volume>6</volume>
<fpage>79</fpage>
<pub-id pub-id-type="pmid">15921533</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Li1">
<label>39</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.</article-title>
<source>Bioinformatics</source>
<volume>22</volume>
<fpage>1658</fpage>
<lpage>1659</lpage>
<pub-id pub-id-type="pmid">16731699</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Huang1">
<label>40</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Honda</surname>
<given-names>W</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>CED: a conformational epitope database.</article-title>
<source>BMC Immunol</source>
<volume>7</volume>
<fpage>7</fpage>
<pub-id pub-id-type="pmid">16603068</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Ahmad1">
<label>41</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ahmad</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gromiha</surname>
<given-names>MM</given-names>
</name>
<name>
<surname>Sarai</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2003</year>
<article-title>Real value prediction of solvent accessibility from amino acid sequence.</article-title>
<source>Proteins</source>
<volume>50</volume>
<fpage>629</fpage>
<lpage>635</lpage>
<pub-id pub-id-type="pmid">12577269</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Ahmad2">
<label>42</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ahmad</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gromiha</surname>
<given-names>MM</given-names>
</name>
<name>
<surname>Sarai</surname>
<given-names>A</given-names>
</name>
</person-group>
<year>2004</year>
<article-title>Analysis and prediction of DNA binding proteins and their binding residues based on composition, sequence and structural information.</article-title>
<source>Bioinformatics</source>
<volume>20</volume>
<fpage>477</fpage>
<lpage>486</lpage>
<pub-id pub-id-type="pmid">14990443</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Wang1">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Samudrala</surname>
<given-names>R</given-names>
</name>
</person-group>
<year>2006</year>
<article-title>Incorporating background frequency improves entropy-based residue conservation measures.</article-title>
<source>BMC Bioinformatics</source>
<volume>7</volume>
<fpage>385</fpage>
<pub-id pub-id-type="pmid">16916457</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Mizianty1">
<label>44</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mizianty</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Kurgan</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences.</article-title>
<source>BMC Bioinformatics</source>
<volume>10</volume>
<fpage>414</fpage>
<pub-id pub-id-type="pmid">20003388</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Chen3">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Mizianty</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Kurgan</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>ATPsite: sequence-based prediction of ATP-binding residues.</article-title>
<source>Proteome Sci</source>
<volume>9</volume>
<fpage>S4</fpage>
<pub-id pub-id-type="pmid">22165846</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Mizianty2">
<label>46</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mizianty</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Stach</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Kedarisetti</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Disfani</surname>
<given-names>FM</given-names>
</name>
<etal></etal>
</person-group>
<year>2010</year>
<article-title>Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources.</article-title>
<source>Bioinformatics</source>
<volume>26</volume>
<fpage>i489</fpage>
<lpage>496</lpage>
<pub-id pub-id-type="pmid">20823312</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Mizianty3">
<label>47</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mizianty</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Kurgan</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Sequence-based prediction of protein crystallization, purification and production propensity.</article-title>
<source>Bioinformatics</source>
<volume>27</volume>
<fpage>i24</fpage>
<lpage>33</lpage>
<pub-id pub-id-type="pmid">21685077</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Chen4">
<label>48</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Mizianty</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Kurgan</surname>
<given-names>L</given-names>
</name>
</person-group>
<year>2012</year>
<article-title>Prediction and analysis of nucleotide binding residues using sequence and sequence-derived structural descriptors.</article-title>
<source>Bioinformatics</source>
<volume>28(3)</volume>
<fpage>331</fpage>
<lpage>341</lpage>
<pub-id pub-id-type="pmid">22130595</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Zhang2">
<label>49</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<year>2008</year>
<article-title>Accurate sequence-based prediction of catalytic residues.</article-title>
<source>Bioinformatics</source>
<volume>24</volume>
<fpage>2329</fpage>
<lpage>2338</lpage>
<pub-id pub-id-type="pmid">18710875</pub-id>
</element-citation>
</ref>
<ref id="pone.0040104-Kurgan1">
<label>50</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kurgan</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Miri Disfani</surname>
<given-names>F</given-names>
</name>
</person-group>
<year>2011</year>
<article-title>Structural protein descriptors in 1-dimension and their sequence-based predictions.</article-title>
<source>Curr Protein Pept</source>
<volume>Sci</volume>
<issue>12(6)</issue>
<fpage>470</fpage>
<lpage>489</lpage>
</element-citation>
</ref>
<ref id="pone.0040104-Mohan1">
<label>51</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mohan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Oldfield</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Radivojac</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Vacic</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Cortese</surname>
<given-names>MS</given-names>
</name>
<etal></etal>
</person-group>
<year>2006</year>
<article-title>Analysis of molecular recognition features (MoRFs). J Mol Biol.</article-title>
<volume>362(5)</volume>
<fpage>1043</fpage>
<lpage>1059</lpage>
</element-citation>
</ref>
<ref id="pone.0040104-Mszros1">
<label>52</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mészáros</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Simon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Dosztányi</surname>
<given-names>Z</given-names>
</name>
</person-group>
<year>2009</year>
<article-title>Prediction of protein binding regions in disordered proteins. PLoS Comput Biol.</article-title>
<volume>5(5)</volume>
<fpage>e1000376</fpage>
</element-citation>
</ref>
<ref id="pone.0040104-MiriDisfani1">
<label>53</label>
<element-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Miri Disfani</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>W-L</given-names>
</name>
<name>
<surname>Mizianty</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Oldfield</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Xue</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<year>2012</year>
<article-title>MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins.</article-title>
<source>Bioinformatics, in print</source>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001069 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 001069 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3384636
   |texte=   BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:22761950" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021