TelematiV1, Pmc, Corpus, bibRecord, 000039

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

Identifieur interne : 000039 ( Pmc/Corpus ); précédent : 000038; suivant : 000040

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

Auteurs : Marcos Antonio Mouri O García ; Roberto Pérez Rodríguez ; Luis E. Anido Rif N

Source :

PeerJ [ 2167-8359 ] ; 2015.

RBID : PMC:4592155

Abstract

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4592155

DOI: 10.7717/peerj.1279
PubMed: 26468436
PubMed Central: 4592155

Links to Exploration step

PMC:4592155

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach</title>
<author><name sortKey="Mouri O Garcia, Marcos Antonio" sort="Mouri O Garcia, Marcos Antonio" uniqKey="Mouri O Garcia M" first="Marcos Antonio" last="Mouri O García">Marcos Antonio Mouri O García</name>
<affiliation><nlm:aff id="aff-1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Perez Rodriguez, Roberto" sort="Perez Rodriguez, Roberto" uniqKey="Perez Rodriguez R" first="Roberto" last="Pérez Rodríguez">Roberto Pérez Rodríguez</name>
<affiliation><nlm:aff id="aff-1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Anido Rif N, Luis E" sort="Anido Rif N, Luis E" uniqKey="Anido Rif N L" first="Luis E." last="Anido Rif N">Luis E. Anido Rif N</name>
<affiliation><nlm:aff id="aff-1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26468436</idno>
<idno type="pmc">4592155</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4592155</idno>
<idno type="RBID">PMC:4592155</idno>
<idno type="doi">10.7717/peerj.1279</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000039</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000039</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach</title>
<author><name sortKey="Mouri O Garcia, Marcos Antonio" sort="Mouri O Garcia, Marcos Antonio" uniqKey="Mouri O Garcia M" first="Marcos Antonio" last="Mouri O García">Marcos Antonio Mouri O García</name>
<affiliation><nlm:aff id="aff-1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Perez Rodriguez, Roberto" sort="Perez Rodriguez, Roberto" uniqKey="Perez Rodriguez R" first="Roberto" last="Pérez Rodríguez">Roberto Pérez Rodríguez</name>
<affiliation><nlm:aff id="aff-1"></nlm:aff>
</affiliation>
</author>
<author><name sortKey="Anido Rif N, Luis E" sort="Anido Rif N, Luis E" uniqKey="Anido Rif N L" first="Luis E." last="Anido Rif N">Luis E. Anido Rif N</name>
<affiliation><nlm:aff id="aff-1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PeerJ</title>
<idno type="eISSN">2167-8359</idno>
<imprint><date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Aronson, Ar" uniqKey="Aronson A">AR Aronson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Blei, Dm" uniqKey="Blei D">DM Blei</name>
</author>
<author><name sortKey="Ng, Ay" uniqKey="Ng A">AY Ng</name>
</author>
<author><name sortKey="Jordan, Mi" uniqKey="Jordan M">MI Jordan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Blizard, Wd" uniqKey="Blizard W">WD Blizard</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bloehdorn, S" uniqKey="Bloehdorn S">S Bloehdorn</name>
</author>
<author><name sortKey="Hotho, A" uniqKey="Hotho A">A Hotho</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bodenreider, O" uniqKey="Bodenreider O">O Bodenreider</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dai, M" uniqKey="Dai M">M Dai</name>
</author>
<author><name sortKey="Shah, Nh" uniqKey="Shah N">NH Shah</name>
</author>
<author><name sortKey="Xuan, W" uniqKey="Xuan W">W Xuan</name>
</author>
<author><name sortKey="Musen, Ma" uniqKey="Musen M">MA Musen</name>
</author>
<author><name sortKey="Watson, Sj" uniqKey="Watson S">SJ Watson</name>
</author>
<author><name sortKey="Athey, Bd" uniqKey="Athey B">BD Athey</name>
</author>
<author><name sortKey="Meng, F" uniqKey="Meng F">F Meng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deerwester, S" uniqKey="Deerwester S">S Deerwester</name>
</author>
<author><name sortKey="Dumais, St" uniqKey="Dumais S">ST Dumais</name>
</author>
<author><name sortKey="Furnas, Gw" uniqKey="Furnas G">GW Furnas</name>
</author>
<author><name sortKey="Landauer, Tk" uniqKey="Landauer T">TK Landauer</name>
</author>
<author><name sortKey="Harshman, R" uniqKey="Harshman R">R Harshman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Egozi, O" uniqKey="Egozi O">O Egozi</name>
</author>
<author><name sortKey="Markovitch, S" uniqKey="Markovitch S">S Markovitch</name>
</author>
<author><name sortKey="Gabrilovich, E" uniqKey="Gabrilovich E">E Gabrilovich</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Elkin, Pl" uniqKey="Elkin P">PL Elkin</name>
</author>
<author><name sortKey="Cimino, Jj" uniqKey="Cimino J">JJ Cimino</name>
</author>
<author><name sortKey="Lowe, Hj" uniqKey="Lowe H">HJ Lowe</name>
</author>
<author><name sortKey="Aronow, Db" uniqKey="Aronow D">DB Aronow</name>
</author>
<author><name sortKey="Payne, Th" uniqKey="Payne T">TH Payne</name>
</author>
<author><name sortKey="Pincetl, Ps" uniqKey="Pincetl P">PS Pincetl</name>
</author>
<author><name sortKey="Barnett, Go" uniqKey="Barnett G">GO Barnett</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gabrilovich, E" uniqKey="Gabrilovich E">E Gabrilovich</name>
</author>
<author><name sortKey="Markovitch, S" uniqKey="Markovitch S">S Markovitch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gabrilovich, E" uniqKey="Gabrilovich E">E Gabrilovich</name>
</author>
<author><name sortKey="Markovitch, S" uniqKey="Markovitch S">S Markovitch</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Godbole, S" uniqKey="Godbole S">S Godbole</name>
</author>
<author><name sortKey="Sarawagi, S" uniqKey="Sarawagi S">S Sarawagi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Harris, Zs" uniqKey="Harris Z">ZS Harris</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hearst, M" uniqKey="Hearst M">M Hearst</name>
</author>
<author><name sortKey="Dumais, S" uniqKey="Dumais S">S Dumais</name>
</author>
<author><name sortKey="Osman, E" uniqKey="Osman E">E Osman</name>
</author>
<author><name sortKey="Platt, J" uniqKey="Platt J">J Platt</name>
</author>
<author><name sortKey="Scholkopf, B" uniqKey="Scholkopf B">B Scholkopf</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huang, L" uniqKey="Huang L">L Huang</name>
</author>
<author><name sortKey="Milne, D" uniqKey="Milne D">D Milne</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Joachims, T" uniqKey="Joachims T">T Joachims</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jonquet, C" uniqKey="Jonquet C">C Jonquet</name>
</author>
<author><name sortKey="Shah, Nh" uniqKey="Shah N">NH Shah</name>
</author>
<author><name sortKey="Musen, Ma" uniqKey="Musen M">MA Musen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kang, N" uniqKey="Kang N">N Kang</name>
</author>
<author><name sortKey="Afzal, Z" uniqKey="Afzal Z">Z Afzal</name>
</author>
<author><name sortKey="Singh, B" uniqKey="Singh B">B Singh</name>
</author>
<author><name sortKey="Van Mulligen, Em" uniqKey="Van Mulligen E">EM Van Mulligen</name>
</author>
<author><name sortKey="Kors, Ja" uniqKey="Kors J">JA Kors</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kim, H" uniqKey="Kim H">H Kim</name>
</author>
<author><name sortKey="Howland, P" uniqKey="Howland P">P Howland</name>
</author>
<author><name sortKey="Park, H" uniqKey="Park H">H Park</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Landauer, Tk" uniqKey="Landauer T">TK Landauer</name>
</author>
<author><name sortKey="Dumais, St" uniqKey="Dumais S">ST Dumais</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Levelt, Wj" uniqKey="Levelt W">WJ Levelt</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lipscomb, Ce" uniqKey="Lipscomb C">CE Lipscomb</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lowe, Hj" uniqKey="Lowe H">HJ Lowe</name>
</author>
<author><name sortKey="Barnett, Go" uniqKey="Barnett G">GO Barnett</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Medelyan, O" uniqKey="Medelyan O">O Medelyan</name>
</author>
<author><name sortKey="Witten, Ih" uniqKey="Witten I">IH Witten</name>
</author>
<author><name sortKey="Milne, D" uniqKey="Milne D">D Milne</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Milne, D" uniqKey="Milne D">D Milne</name>
</author>
<author><name sortKey="Witten, Ih" uniqKey="Witten I">IH Witten</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Pedregosa, F" uniqKey="Pedregosa F">F Pedregosa</name>
</author>
<author><name sortKey="Varoquaux, G" uniqKey="Varoquaux G">G Varoquaux</name>
</author>
<author><name sortKey="Gramfort, A" uniqKey="Gramfort A">A Gramfort</name>
</author>
<author><name sortKey="Michel, V" uniqKey="Michel V">V Michel</name>
</author>
<author><name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
<author><name sortKey="Grisel, O" uniqKey="Grisel O">O Grisel</name>
</author>
<author><name sortKey="Blondel, M" uniqKey="Blondel M">M Blondel</name>
</author>
<author><name sortKey="Prettenhofer, P" uniqKey="Prettenhofer P">P Prettenhofer</name>
</author>
<author><name sortKey="Weiss, R" uniqKey="Weiss R">R Weiss</name>
</author>
<author><name sortKey="Dubourg, V" uniqKey="Dubourg V">V Dubourg</name>
</author>
<author><name sortKey="Vanderplas, J" uniqKey="Vanderplas J">J Vanderplas</name>
</author>
<author><name sortKey="Passos, A" uniqKey="Passos A">A Passos</name>
</author>
<author><name sortKey="Cournapeau, D" uniqKey="Cournapeau D">D Cournapeau</name>
</author>
<author><name sortKey="Brucher, M" uniqKey="Brucher M">M Brucher</name>
</author>
<author><name sortKey="Perrot, M" uniqKey="Perrot M">M Perrot</name>
</author>
<author><name sortKey="Duchesnay, E" uniqKey="Duchesnay E">E Duchesnay</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Phan, X H" uniqKey="Phan X">X-H Phan</name>
</author>
<author><name sortKey="Nguyen, L M" uniqKey="Nguyen L">L-M Nguyen</name>
</author>
<author><name sortKey="Horiguchi, S" uniqKey="Horiguchi S">S Horiguchi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Porter, Mf" uniqKey="Porter M">MF Porter</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rigutini, L" uniqKey="Rigutini L">L Rigutini</name>
</author>
<author><name sortKey="Maggini, M" uniqKey="Maggini M">M Maggini</name>
</author>
<author><name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sahlgren, M" uniqKey="Sahlgren M">M Sahlgren</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sahlgren, M" uniqKey="Sahlgren M">M Sahlgren</name>
</author>
<author><name sortKey="Coster, R" uniqKey="Coster R">R Cöster</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Salton, G" uniqKey="Salton G">G Salton</name>
</author>
<author><name sortKey="Wong, A" uniqKey="Wong A">A Wong</name>
</author>
<author><name sortKey="Yang, Cs" uniqKey="Yang C">CS Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schapire, Re" uniqKey="Schapire R">RE Schapire</name>
</author>
<author><name sortKey="Singer, Y" uniqKey="Singer Y">Y Singer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sebastiani, F" uniqKey="Sebastiani F">F Sebastiani</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Settles, B" uniqKey="Settles B">B Settles</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stock, Wg" uniqKey="Stock W">WG Stock</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="T Ckstrom, O" uniqKey="T Ckstrom O">O Täckström</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tsao, Y" uniqKey="Tsao Y">Y Tsao</name>
</author>
<author><name sortKey="Chen, Ky" uniqKey="Chen K">KY Chen</name>
</author>
<author><name sortKey="Wang, Hm" uniqKey="Wang H">HM Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tsoumakas, G" uniqKey="Tsoumakas G">G Tsoumakas</name>
</author>
<author><name sortKey="Katakis, I" uniqKey="Katakis I">I Katakis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Vivaldi, J" uniqKey="Vivaldi J">J Vivaldi</name>
</author>
<author><name sortKey="Rodriguez, H" uniqKey="Rodriguez H">H Rodríguez</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, P" uniqKey="Wang P">P Wang</name>
</author>
<author><name sortKey="Hu, J" uniqKey="Hu J">J Hu</name>
</author>
<author><name sortKey="Zeng, H J" uniqKey="Zeng H">H-J Zeng</name>
</author>
<author><name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wang, P" uniqKey="Wang P">P Wang</name>
</author>
<author><name sortKey="Hu, J" uniqKey="Hu J">J Hu</name>
</author>
<author><name sortKey="Zeng, H J" uniqKey="Zeng H">H-J Zeng</name>
</author>
<author><name sortKey="Chen, L" uniqKey="Chen L">L Chen</name>
</author>
<author><name sortKey="Chen, Z" uniqKey="Chen Z">Z Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yang, Y" uniqKey="Yang Y">Y Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yetisgen Yildiz, M" uniqKey="Yetisgen Yildiz M">M Yetisgen-Yildiz</name>
</author>
<author><name sortKey="Pratt, W" uniqKey="Pratt W">W Pratt</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author><name sortKey="Phan, X H" uniqKey="Phan X">X-H Phan</name>
</author>
<author><name sortKey="Horiguchi, S" uniqKey="Horiguchi S">S Horiguchi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zheng, B" uniqKey="Zheng B">B Zheng</name>
</author>
<author><name sortKey="Mclean, Dc" uniqKey="Mclean D">DC McLean</name>
</author>
<author><name sortKey="Lu, X" uniqKey="Lu X">X Lu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhou, X" uniqKey="Zhou X">X Zhou</name>
</author>
<author><name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author><name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhou, X" uniqKey="Zhou X">X Zhou</name>
</author>
<author><name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author><name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhou, X" uniqKey="Zhou X">X Zhou</name>
</author>
<author><name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author><name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">PeerJ</journal-id>
<journal-id journal-id-type="iso-abbrev">PeerJ</journal-id>
<journal-id journal-id-type="pmc">PeerJ</journal-id>
<journal-id journal-id-type="publisher-id">PeerJ</journal-id>
<journal-title-group><journal-title>PeerJ</journal-title>
</journal-title-group>
<issn pub-type="epub">2167-8359</issn>
<publisher><publisher-name>PeerJ Inc.</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">26468436</article-id>
<article-id pub-id-type="pmc">4592155</article-id>
<article-id pub-id-type="publisher-id">1279</article-id>
<article-id pub-id-type="doi">10.7717/peerj.1279</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Bioinformatics</subject>
</subj-group>
<subj-group subj-group-type="heading"><subject>Science and Medical Education</subject>
</subj-group>
<subj-group subj-group-type="heading"><subject>Human-Computer Interaction</subject>
</subj-group>
<subj-group subj-group-type="heading"><subject>Computational Science</subject>
</subj-group>
</article-categories>
<title-group><article-title>Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach</article-title>
</title-group>
<contrib-group><contrib id="author-1" contrib-type="author" corresp="yes"><name><surname>Mouriño García</surname>
<given-names>Marcos Antonio</given-names>
</name>
<xref ref-type="aff" rid="aff-1"></xref>
<email>marcosmourino@gmail.com</email>
</contrib>
<contrib id="author-2" contrib-type="author"><name><surname>Pérez Rodríguez</surname>
<given-names>Roberto</given-names>
</name>
<xref ref-type="aff" rid="aff-1"></xref>
</contrib>
<contrib id="author-3" contrib-type="author"><name><surname>Anido Rifón</surname>
<given-names>Luis E.</given-names>
</name>
<xref ref-type="aff" rid="aff-1"></xref>
</contrib>
<aff id="aff-1"><institution>Department of Telematics Engineering, University of Vigo</institution>
,<addr-line>Vigo</addr-line>
,<country>Spain</country>
</aff>
</contrib-group>
<contrib-group><contrib id="editor-1" contrib-type="editor"><name><surname>Perry</surname>
<given-names>George</given-names>
</name>
</contrib>
</contrib-group>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2015-09-29"><day>29</day>
<month>9</month>
<year iso-8601-date="2015">2015</year>
</pub-date>
<pub-date pub-type="collection"><year>2015</year>
</pub-date>
<volume>3</volume>
<elocation-id>e1279</elocation-id>
<history><date date-type="received" iso-8601-date="2015-08-03"><day>3</day>
<month>8</month>
<year iso-8601-date="2015">2015</year>
</date>
<date date-type="accepted" iso-8601-date="2015-09-07"><day>7</day>
<month>9</month>
<year iso-8601-date="2015">2015</year>
</date>
</history>
<permissions><copyright-statement>© 2015 Mouriño García et al.</copyright-statement>
<copyright-year>2015</copyright-year>
<copyright-holder>Mouriño García et al.</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><license-p>This is an open access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://peerj.com/articles/1279"></self-uri>
<abstract><p>Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.</p>
</abstract>
<kwd-group kwd-group-type="author"><kwd>Biomedical literature</kwd>
<kwd>Classification</kwd>
<kwd>Wikipedia</kwd>
<kwd>Encyclopedic knowledge</kwd>
<kwd>Document representation</kwd>
<kwd>Bag-of-concepts</kwd>
<kwd>Bag-of-words</kwd>
<kwd>OHSUMED</kwd>
</kwd-group>
<funding-group><award-group id="fund-1"><funding-source>Galician Regional Government</funding-source>
<award-id>GRC2013-006</award-id>
</award-group>
<award-group id="fund-2"><funding-source>REDPLIR (Red Gallega de Procesamiento del Lenguaje y Recuperacion de Informacion)</funding-source>
<award-id>R2014/034</award-id>
</award-group>
<funding-statement>Research partially supported by the Galician Regional Government under project GRC2013-006 (Consolidation of Research Units) and through REDPLIR (Red Gallega de Procesamiento del Lenguaje y Recuperacion de Informacion)—R2014/034. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
</article-meta>
</front>
<body><sec sec-type="intro"><title>Introduction</title>
<p>The ability to automatically classify text documents into a predefined set of categories is extremely convenient. Examples of this include the classification of educational resources into subjects such as mathematics, science, or history; the classification of books into thematic areas; and the classification of news into sections such as economy, politics, or sports. Among these and other applications, the automatic classification of biomedical literature stands out as an important application to leverage text document automatic classification strategies. Medical staff, scientists, and biomedical researchers handle in their daily work huge amounts of literature and biomedical information, so it is necessary to have a system that allows for accessing to documents of interest in a simple, effective, efficient, and quick way; thus saving time querying or searching for these documents. This implies the necessity for sorting or ranking the documents based on some criterion, i.e., their classification.</p>
<p>Classification is modelled as a supervised learning problem: first, the classifier is trained with a certain number of examples—documents whose category is known—and then, the algorithm is applied to another set of documents whose category is unknown (<xref rid="ref-34" ref-type="bibr">Sebastiani, 2002</xref>
). There is a huge amount of classification algorithms, including <italic>k</italic>
-Nearest Neighbor (KNN), Decision Tree (DT), Neural Networks, Bayes, and Support Vector Machines (SVM) (<xref rid="ref-43" ref-type="bibr">Yang, 1999</xref>
).</p>
<p>The functioning of classifiers is based on the application of Natural Language Processing (NLP) techniques to the documents to classify, so that a software agent can recognise which category a given document belongs to, based on some NLP feature contained in it, such as word occurrence frequency or the structure of the language used (<xref rid="ref-35" ref-type="bibr">Settles, 2010</xref>
). Vector Space Model (VSM) (<xref rid="ref-32" ref-type="bibr">Salton, Wong & Yang, 1975</xref>
) is the most often used representation, where each document within a collection is represented as a point in space, commonly using as weights the frequency of occurrence of words. When words are used as features, the model is known as bag-of-words, being a bag—or multiset—a set of elements that can occur more than once (<xref rid="ref-3" ref-type="bibr">Blizard, 1988</xref>
). Then, by using this representation, a document is characterised by a set of words that appear in text, repeated as many times as occurrences in text.</p>
<p>Despite being one of the traditionally used representations in document classification tasks (<xref rid="ref-37" ref-type="bibr">Täckström, 2005</xref>
), the BoW model is suboptimal, because it only accounts for word frequency in the documents, and ignores important semantic relationships between them (<xref rid="ref-41" ref-type="bibr">Wang et al., 2008</xref>
). The main limitations of the BoW representation are redundancy, ambiguity, orthogonality, hyponymy and hypernymy problems, data sparseness and word usage diversity. Redundancy means that synonymous are not unified (<xref rid="ref-8" ref-type="bibr">Egozi, Markovitch & Gabrilovich, 2011</xref>
; <xref rid="ref-15" ref-type="bibr">Huang & Milne, 2012</xref>
). For instance, in the BoW model, if a document that contains the word “tumour” was classified into the “cancer” category, this would not provide information to classify a document that contains the phrase “neoplasm”.</p>
<list list-type="simple" id="list-1"><list-item><label>•</label>
<p>Ambiguity refers to the problem of polysemy—one word can have several meanings (<xref rid="ref-37" ref-type="bibr">Täckström, 2005</xref>
; <xref rid="ref-8" ref-type="bibr">Egozi, Markovitch & Gabrilovich, 2011</xref>
). For instance, in the BoW model, if a document that contains the word “tissue” was classified into the “human anatomy” category, it would may cause errors when classifying a document that contains the word “tissue”, but meaning the <italic>Triphosa dubitata</italic>
 moth.</p>
</list-item>
<list-item><label>•</label>
<p>Orthogonality problem means that the semantic relatedness between words is not taken into account (<xref rid="ref-15" ref-type="bibr">Huang & Milne, 2012</xref>
). For example, knowing that a document that contains “cardiovascular system” was classified under the label “circulatory system” would not give information on how to classify a document that contains the word “blood”.</p>
</list-item>
<list-item><label>•</label>
<p>Hyponymy and hypernymy problem means that the hierarchical relations are not leveraged (<xref rid="ref-21" ref-type="bibr">Levelt, 1993</xref>
; <xref rid="ref-42" ref-type="bibr">Wang et al., 2007</xref>
; <xref rid="ref-41" ref-type="bibr">Wang et al., 2008</xref>
). For instance, if a document that contains the word “heart” was classified into “human body” category, this would not provide information to classify a document that contains the word “organ” and vice versa.</p>
</list-item>
<list-item><label>•</label>
<p>BoW representations often suffer from the problems of data sparseness (zero-probability problem) and word usage diversity. This is because the BoW model only considers frequencies of words occurring in a class, and each document often contains only a small fraction of all words in the lexicon, which degrades the performance of the classifer (<xref rid="ref-37" ref-type="bibr">Täckström, 2005</xref>
; <xref rid="ref-38" ref-type="bibr">Tsao, Chen & Wang, 2013</xref>
).</p>
</list-item>
</list>
<p>The most relevant works found in the literature focus mainly on solving two of the aforementioned problems: synonymy and the polysemy. To accomplish this, several authors have proposed a concept-based document representation, defining concept as “unit of meaning” (<xref rid="ref-24" ref-type="bibr">Medelyan, Witten & Milne, 2008</xref>
; <xref rid="ref-41" ref-type="bibr">Wang et al., 2008</xref>
; <xref rid="ref-36" ref-type="bibr">Stock, 2010</xref>
). Several previous works demonstrated that this representation provides good results in classification tasks (<xref rid="ref-31" ref-type="bibr">Sahlgren & Cöster, 2004</xref>
; <xref rid="ref-41" ref-type="bibr">Wang et al., 2008</xref>
).</p>
<p>The literature hosts several ways to create this based-of-concepts representation. In Latent Semantic Analysis (LSA) (<xref rid="ref-7" ref-type="bibr">Deerwester et al., 1990</xref>
; <xref rid="ref-20" ref-type="bibr">Landauer & Dumais, 1997</xref>
) a concept is a vector that represents the context in which a term occurs; this approach overcomes synonymy but not polysemy. In Latent Dirichlet Allocation (LDA) (<xref rid="ref-2" ref-type="bibr">Blei, Ng & Jordan, 2003</xref>
) each concept consists of a bag-of-words that represents an underlying topic in the text. In Explicit Semantic Analysis (ESA) (<xref rid="ref-10" ref-type="bibr">Gabrilovich & Markovitch, 2007</xref>
) concepts are entries from external knowledge bases such as Wikipedia, WordNet, or Open Directory Project (ODP); these concepts are assigned to documents—annotation process—in accordance with its overlap with each entry in the knowledge base; its main disadvantage is its tendency toward generating outliers (<xref rid="ref-8" ref-type="bibr">Egozi, Markovitch & Gabrilovich, 2011</xref>
)—concepts that have a weak relationship to the document to annotate. Semantic annotators—the approach used in our proposal—extract concepts, disambiguate them, link them to domain-specific external sources—such as Unified Medical Language System (UMLS) or Medical Subject Headings (MeSH)—or to general-purpose external sources—such as Wikipedia—and deal with synonymy and polysemy problems.</p>
<p>We think that there is a research gap in the application of BoC representations that leverage encyclopedic knowledge in the building of classifiers of biomedical literature. This article aims at bridging this gap by designing, developing, and evaluating a classifier—single-label and multi-label—of biomedical literature that builds on encyclopedic knowledge and represents documents as bags-of-concepts. In order to evaluate the system, we conducted several experiments with one of the most commonly used corpora for evaluating classification and retrieval of biomedical information—OHSUMED—as well as with a purpose-built corpus that comprises MEDLINE biomedical abstracts published in 2014—UVigoMED. Results obtained show a superior performance of the classifier when using the BoC representation, and it is an indicative of the potential of the proposed system to automatically classify scientific literature in the biomedical domain.</p>
<p>The remainder of this article is organised as follows: ‘Background’ presents some background knowledge; ‘Materials and Methods’ presents the corpora used, the algorithms, classification strategies, and metrics employed, and the approach proposed; ‘Results’ shows results obtained; ‘Discussion’ discusses the results obtained and presents proposals for future work; finally, ‘Conclusions’ presents the conclusions obtained.</p>
</sec>
<sec><title>Background</title>
<p>In order to create the BoC representation of documents we use a general purpose semantic annotator. The literature contains other proposals for the creation of representations as bags-of-concepts. In this section, we discuss the main proposals for creating representations of documents as bags-of-concepts—Latent Semantic Analysis, Latent Dirichlet Allocation, Explicit Semantic Analysis, domain-specific semantic annotators, general-purpose semantic annotators, and hybrid semantic annotators—and proposals for biomedical literature classification that make use of these representations.</p>
<sec><title>Latent Semantic Analysis</title>
<p>In the theoretical basis of Latent Semantic Analysis model underlies the distributional hypothesis (<xref rid="ref-13" ref-type="bibr">Harris, 1968</xref>
; <xref rid="ref-30" ref-type="bibr">Sahlgren, 2008</xref>
): words that appear in similar contexts have similar meanings (<xref rid="ref-7" ref-type="bibr">Deerwester et al., 1990</xref>
; <xref rid="ref-20" ref-type="bibr">Landauer & Dumais, 1997</xref>
). In LSA, the meaning of a word is represented as a vector of occurrences of that word in different contexts—being a context a text document. Although LSA combats the synonymy problem, it does not combat polysemy.</p>
<p>The LSA model has been used by several authors for biomedical literature classification tasks. <xref rid="ref-19" ref-type="bibr">Kim, Howland & Park (2005)</xref>
 explore the dimensionality reduction provided by LSA for classifying a subset of MEDLINE, reporting precision values reaching 90%. <xref rid="ref-37" ref-type="bibr">Täckström (2005)</xref>
 also makes use of the LSA model for the categorisation of a subset of MEDLINE, obtaining positive results using BoC in categories where BoW fails; despite the fact that results are positive, the author recommends using BoW as the primary representation mechanism and BoC as a punctual complement.</p>
</sec>
<sec><title>Latent Dirichlet Allocation</title>
<p>Latent Dirichlet Allocation model (<xref rid="ref-2" ref-type="bibr">Blei, Ng & Jordan, 2003</xref>
) presupposes that each document within a collection comprises a small number of topics, each one of them “generating” words. Thus, LDA automatically finds topics in a text, or in other words, LDA attempts “to go back” from the document and find the set of topics that may have generated it. <xref rid="ref-46" ref-type="bibr">Zheng, McLean & Lu (2006)</xref>
 make use of LDA to identify biological topics—i.e., concepts—from a corpus composed of biomedical articles that belong to MEDLINE; to that end, first, they use LDA to identify the most relevant concepts, and subsequently, these concepts are mapped to a biomedical vocabulary: Gene Ontology. <xref rid="ref-27" ref-type="bibr">Phan, Nguyen & Horiguchi (2008)</xref>
 get good results in the classification of short texts—OSHUMED abstracts—making use of a BoC document representation whose concepts were extracted using LDA. <xref rid="ref-45" ref-type="bibr">Zhang, Phan & Horiguchi (2008)</xref>
 focus on improving the performance of a classifier, making use of LDA to reduce the dimensionality of the set of features employed; the proposed method is applied to the biomedical corpus OHSUMED, obtaining results that demonstrate that the approach proposed provides better precision values, while reducing the size of the feature space.</p>
</sec>
<sec><title>Explicit Semantic Analysis</title>
<p><xref rid="ref-10" ref-type="bibr">Gabrilovich & Markovitch (2007)</xref>
 propose Explicit Semantic Analysis, a technique that leverages external knowledge sources—as Wikipedia or ODP—to generate features from text documents. Contrary to LSA and LDA, ESA makes textual analysis identifying topics that are explicitly present in background knowledge bases—such as Wikipedia or ODP, among others—instead of latent topics. In other words, ESA analyses a text to index it with Wikipedia concepts. <xref rid="ref-11" ref-type="bibr">Gabrilovich & Markovitch (2009)</xref>
 use ESA to extract features from a text and to classify text documents from MEDLINE in categories. Authors report improvements in classification performance by using ESA to generate features of documents.</p>
</sec>
<sec><title>Semantic annotators</title>
<p>A semantic annotator is a software agent that is responsible for extracting the concepts that define a document, linking or mapping these concepts to entries from external sources. Semantic annotators usually perform disambiguation, thus combating synonymy and polysemy problems; and, in some cases, they assign a weight to each extracted concept in accordance with its semantic relevance within the document. Depending on the external source employed to link or map the extracted concepts, two kinds of semantic annotators can be distinguished: domain-specific semantic annotators and general-purpose semantic annotators.</p>
<sec><title>Domain-specific semantic annotators</title>
<p>Domain-specific semantic annotators use external sources of a particular domain as knowledge bases to map extracted concepts. In the biomedical domain there are several biomedical ontologies, being the most relevant in the state-of-the-art MeSH (<xref rid="ref-23" ref-type="bibr">Lowe & Barnett, 1994</xref>
; <xref rid="ref-22" ref-type="bibr">Lipscomb, 2000</xref>
) and UMLS (<xref rid="ref-5" ref-type="bibr">Bodenreider, 2004</xref>
). We can find several domain-specific semantic annotators in the literature. <xref rid="ref-9" ref-type="bibr">Elkin et al. (1988)</xref>
 propose a tool to identify MeSH terms in narrative texts. <xref rid="ref-1" ref-type="bibr">Aronson (2001)</xref>
 describes the MetaMap program, which embeds an algorithm that allows for representing biomedical texts through UMLS concepts. <xref rid="ref-17" ref-type="bibr">Jonquet, Shah & Musen (2009)</xref>
 present Open Biomedical Annotator: first, it extracts terms from text documents making use of Mgrep (<xref rid="ref-6" ref-type="bibr">Dai et al., 2008</xref>
); second, it maps these terms to biomedical concepts from UMLS and other biomedical ontologies from the National Centre from Biomedical Ontologies (NCBO); and, finally, it annotates the documents with these concepts. <xref rid="ref-18" ref-type="bibr">Kang et al. (2012)</xref>
 combine seven domain-specific annotators—ABNER, Lingpipe, MetaMap, OpenNLP Chunker, JNET, Peregrine and StandforNer—to extract medical concepts from clinical texts, providing better results than any of the individual systems alone. Several authors make use of these and other semantic annotators for biomedical classification tasks such as: <xref rid="ref-44" ref-type="bibr">Yetisgen-Yildiz & Pratt (2005)</xref>
, who use MetaMap to extract concepts from documents and use it to classify biomedical literature; and <xref rid="ref-48" ref-type="bibr">Zhou, Zhang & Hu (2008a)</xref>
, who use a semantic annotator based on UMLS (MaxMatcher (<xref rid="ref-47" ref-type="bibr">Zhou, Zhang & Hu, 2006</xref>
)) for the Bayesian classification of the biomedical literature corpus OHSUMED.</p>
</sec>
<sec><title>General-purpose semantic annotators</title>
<p>General-purpose semantic annotators use generic knowledge bases—they are not specific of a particular domain—such as Wikipedia, WordNet of FreeBase instead of domain-specific ontologies. <xref rid="ref-40" ref-type="bibr">Vivaldi & Rodríguez (2010)</xref>
 present a system to extract concepts from biomedical text using Wikipedia as semantic information source, and <xref rid="ref-15" ref-type="bibr">Huang & Milne (2012)</xref>
 propose the use of a semantic annotator—using Wikipedia and WordNet as knowledge bases—for creating BoC representations from documents and their use in biomedical literature classification tasks.</p>
</sec>
<sec><title>Hybrid semantic annotators</title>
<p>Hybrid semantic annotators use domain-specific ontologies—such as UMLS or MeSH—and generic knowledge bases—such as WordNet—as background knowledge to extract concepts from a narrative text. Thus, they leverage the advantages of both approaches—the specificity provided by domain-specific ontologies and the generality provided by generic knowledge bases. <xref rid="ref-4" ref-type="bibr">Bloehdorn & Hotho (2004)</xref>
 use this technique to enrich BoW representations of texts with concepts extracted from the text itself making use of MeSH ontology and the lexical database WordNet. This enriched representation is then used to perform the classification of the biomedical literature corpus OHSUMED, reporting F1-values of 48%.</p>
</sec>
</sec>
</sec>
<sec sec-type="materials|methods"><title>Materials and Methods</title>
<sec><title>Dataset</title>
<sec><title>OHSUMED</title>
<p>In order to evaluate the proposed system, we conducted four experiments with the well-known corpus for information retrieval and classification tasks OHSUMED. To carry out the experiments with the multi-label classifier, we used a subset of OHSUMED composed of 23,166 biomedical abstracts of 1991, classified into one or several of the 23 possible categories (<xref rid="ref-16" ref-type="bibr">Joachims, 1998</xref>
). In order to create train and test sequences, we randomly split the corpus in a training sequence that comprises 18,533 documents and a test sequence composed of the remaining 4,633 documents.</p>
<p>To perform the single-label experiments, we removed from the aforementioned corpus those documents belonging to more than one category, resulting in a corpus formed by 9,034 documents classified in only one of the 23 categories; and, then, we randomised it again to split it in a training sequence composed of 7,227 documents and a test sequence that comprises 1,807 documents.</p>
</sec>
<sec><title>UVigoMED</title>
<p>In order to corroborate the results obtained when conducting the experiments over OHSUMED corpus, we expressly created another corpus to conduct the same experiments as in OHSUMED. We named it UVigoMED.<xref ref-type="fn" rid="fn-1"><sup>1</sup>
</xref>
<fn id="fn-1"><label>1</label>
<p>Corpus is available at <uri xlink:href="http://www.itec-sde.net/UVigoMED.zip">http://www.itec-sde.net/UVigoMED.zip</uri>
.</p>
</fn>
 In this section, we briefly describe the corpus and the process of collecting documents. First, we selected the classification scheme, consisting of the MeSH general terms of “diseases” group—the same as in OHSUMED. It is worth noting that, to create the UVigoMED corpus, we used the 2015 MeSH tree structure, where the diseases group contains 26 categories instead of the 23 that contained the MeSH tree structure when OHSUMED was created. To build the corpus we performed the following steps (see <xref ref-type="fig" rid="fig-1">Fig. 1</xref>
):</p>
<list list-type="simple" id="list-2"><list-item><label>•</label>
<p>We downloaded from MEDLINE all the descriptions of the articles (HTML webpages) of year 2014 classified under each one of the 26 categories.</p>
</list-item>
<list-item><label>•</label>
<p>We extracted from each article description: the title, the abstract, and the categories it belongs to.</p>
</list-item>
<list-item><label>•</label>
<p>We stored in our database the title, abstract and categories for each article description that was downloaded.</p>
</list-item>
</list>
<p>As a result, we obtained a corpus that comprises 92,661 biomedical articles classified in one or several categories of the 26 that were available. Finally, in order to create the training and test sequences, we randomly selected 18,532 documents as the test sequence, remaining 74,129 for the training sequence.</p>
<p>To carry out the single-label experiments, we created a subset of the aforementioned corpus comprising those documents belonging to just one category—by removing those that belonged to more than one category—resulting in a corpus composed of 54,853 documents classified in one of the 26 categories, and split randomly in a training sequence that comprises 43,882 documents and a test sequence composed by 10,971 items.</p>
</sec>
</sec>
<sec><title>Multi-label classification methods</title>
<p>There are two main approaches to the multi-label classification problem: problem transformation methods and algorithm adaptation methods. Problem transformation methods are those that transform the multi-label problem in several single-label problems, whereas algorithm adaptation methods consist in performing adaptations of specific algorithms to address multi-label problems directly without performing any transformation.</p>
<p>In our proposal, we opted for using the methods of the first category, i.e., transforming the multi-label problem in N single-label binary problems, one for each category. To perform this, we made use of <italic>Scikit-learn</italic>
, a module for Python that provides a set of the most relevant machine learning algorithms in the state-of-the-art (<xref rid="ref-26" ref-type="bibr">Pedregosa et al., 2012</xref>
). In particular, we made use of the <italic>one-vs-rest</italic>
 or <italic>one-vs-all</italic>
 strategy, that automatically implements a classifier for each category. This strategy also allows for using different classification algorithms, including SVM, which is the one what we chose.</p>
</sec>
<sec><title>SVM algorithm</title>
<p>Support Vector Machines are a set of supervised machine learning algorithms used in clustering, regression, and classification tasks, among others. We selected the SVM algorithm because it is one of the most relevant algorithms in the state-of-the-art—together with Naïve Bayes, <italic>k</italic>
-Nearest Neighbor, Decision Trees, or Neural Networks,— it is one of the most successful machine learning algorithms to perform automatic text classification tasks (<xref rid="ref-29" ref-type="bibr">Rigutini, Maggini & Liu, 2005</xref>
) and it offers higher performance than other relevant algorithms of the state-of-the art such as KNN or Naïve Bayes (<xref rid="ref-43" ref-type="bibr">Yang, 1999</xref>
). Although a more detailed definition can be found in <xref rid="ref-14" ref-type="bibr">Hearst et al. (1998)</xref>
, the basic idea is that, given a set of items belonging to a set of categories, SVM builds a model that can predict which category the new items that appear in the system belong to. SVM represents each item as a point in space, separating the categories as much as possible. Then, when a new item appears in the model, it will be placed in one category or another, depending on their proximity to each one. This algorithm corresponds to the class <italic>sklearn.svm.LinearSVC</italic>
 of the <italic>Scikit-learn</italic>
 library.</p>
</sec>
<sec><title>Evaluation metrics</title>
<p>The single-label and multi-label classification problems make use of different evaluation metrics. Hereafter, we cite the main metrics that the literature shows to evaluate each of the problems.</p>
<sec><title>Single-label classification problem</title>
<p>When predicting the category to which a document belongs, there are four possible outcomes: true positive (TP), true negative (TN), false positive (FP) and false negative (FN), where <italic>positive</italic>
 means that a document was classified in a certain category, <italic>negative</italic>
 means the opposite, <italic>true</italic>
 means that the classification was correct and <italic>false</italic>
 means that the classification was incorrect (<xref rid="ref-31" ref-type="bibr">Sahlgren & Cöster, 2004</xref>
).</p>
<p>In the same way as <xref rid="ref-34" ref-type="bibr">Sebastiani (2002)</xref>
 and <xref rid="ref-31" ref-type="bibr">Sahlgren & Cöster (2004)</xref>
, we define: <disp-formula id="eqn-1"><label>(1)</label>
<alternatives><graphic xlink:href="peerj-03-1279-e001.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M1">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}\begin{eqnarray*} P=\mathrm{Precision}=\frac{T P}{(T P+F P)} \end{eqnarray*}\end{document}</tex-math>
<mml:math id="mml-eqn-1"><mml:mstyle displaystyle="true"><mml:mi>P</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Precision</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow><mml:mfenced separators="" open="(" close=")"><mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:math>
</alternatives>
</disp-formula>
<disp-formula id="eqn-2"><label>(2)</label>
<alternatives><graphic xlink:href="peerj-03-1279-e002.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M2">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}\begin{eqnarray*} R=\mathrm{Recall}=\frac{T P}{(T P+F N)}. \end{eqnarray*}\end{document}</tex-math>
<mml:math id="mml-eqn-2"><mml:mstyle displaystyle="true"><mml:mi>R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Recall</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow><mml:mfenced separators="" open="(" close=")"><mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mstyle>
</mml:math>
</alternatives>
</disp-formula>
</p>
<p>We also use a measure that combines precision and recall, F1-score, defined as: <disp-formula id="eqn-3"><label>(3)</label>
<alternatives><graphic xlink:href="peerj-03-1279-e003.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M3">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}\begin{eqnarray*} {F}_{1}=\frac{2\ast P\ast R}{P+R}. \end{eqnarray*}\end{document}</tex-math>
<mml:math id="mml-eqn-3"><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mn>2</mml:mn>
<mml:mo>∗</mml:mo>
<mml:mi>P</mml:mi>
<mml:mo>∗</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mstyle>
</mml:math>
</alternatives>
</disp-formula>
</p>
<p>In our work we report the results as macro-F1, because it is the best metric to reflect the classification performance in corpora where data are not evenly distributed over different categories (<xref rid="ref-48" ref-type="bibr">Zhou, Zhang & Hu, 2008a</xref>
).</p>
</sec>
<sec><title>Multi-label classification problem</title>
<p><xref rid="ref-33" ref-type="bibr">Schapire & Singer (2000)</xref>
 consider in their work the <italic>Hamming Loss</italic>
, defined according to <xref rid="ref-39" ref-type="bibr">Tsoumakas & Katakis (2007)</xref>
 as <disp-formula id="eqn-4"><label>(4)</label>
<alternatives><graphic xlink:href="peerj-03-1279-e004.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M4">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}\begin{eqnarray*} H L=\text{Hamming Loss}(H,D)=\frac{1}{\vert D\vert }\sum _{i=1}^{\vert D\vert }\frac{\vert {Y}_{i}\Delta {Z}_{i}\vert }{\vert L\vert } \end{eqnarray*}\end{document}</tex-math>
<mml:math id="mml-eqn-4"><mml:mstyle displaystyle="true"><mml:mi>H</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo>=</mml:mo>
<mml:mtext>Hamming Loss</mml:mtext>
<mml:mfenced separators="" open="(" close=")"><mml:mi>H</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>D</mml:mi>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:munderover><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:munderover>
<mml:mfrac><mml:mrow><mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>Δ</mml:mi>
<mml:msub><mml:mrow><mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>L</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:math>
</alternatives>
</disp-formula>
 being <italic>D</italic>
 the multi-label corpus, that comprises |<italic>D</italic>
| multi-label elements (<italic>x</italic>
<sub><italic>i</italic>
,</sub>
<italic>Y<sub>i</sub>
</italic>
), <italic>i</italic>
 = 1...|<italic>D</italic>
|, <italic>Y<sub>i</sub>
</italic>
 ⊆ <italic>L</italic>
, <italic>L</italic>
 is the set of labels, composed by |<italic>L</italic>
| labels, <italic>H</italic>
 is a multi-label classifier, <italic>Z</italic>
 = <italic>H</italic>
(<italic>x<sub>i</sub>
</italic>
) is the set of labels predicted by <italic>H</italic>
 for <italic>x<sub>i</sub>
</italic>
, and Δ represents the symmetric difference between two sets, corresponding to the XOR operation in the Boolean algebra.</p>
<p>The following metrics—<italic>Accuracy</italic>
, <italic>Precision, and Recall</italic>
—are used by <xref rid="ref-12" ref-type="bibr">Godbole & Sarawagi (2004)</xref>
 and defined again by <xref rid="ref-39" ref-type="bibr">Tsoumakas & Katakis (2007)</xref>
 as: <disp-formula id="eqn-5"><label>(5)</label>
<alternatives><graphic xlink:href="peerj-03-1279-e005.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M5">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}\begin{eqnarray*} A=\mathrm{Accuracy}(H,D)=\frac{1}{\vert D\vert }\sum _{i=1}^{\vert D\vert }\frac{\vert {Y}_{i}\cap {Z}_{i}\vert }{\vert {Y}_{i}\cup {Z}_{i}\vert } \end{eqnarray*}\end{document}</tex-math>
<mml:math id="mml-eqn-5"><mml:mstyle displaystyle="true"><mml:mi>A</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Accuracy</mml:mi>
<mml:mfenced separators="" open="(" close=")"><mml:mi>H</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>D</mml:mi>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:munderover><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:munderover>
<mml:mfrac><mml:mrow><mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∩</mml:mo>
<mml:msub><mml:mrow><mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∪</mml:mo>
<mml:msub><mml:mrow><mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:math>
</alternatives>
</disp-formula>
<disp-formula id="eqn-6"><label>(6)</label>
<alternatives><graphic xlink:href="peerj-03-1279-e006.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M6">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}\begin{eqnarray*} P=\mathrm{Precision}(H,D)=\frac{1}{\vert D\vert }\sum _{i=1}^{\vert D\vert }\frac{\vert {Y}_{i}\cap {Z}_{i}\vert }{\vert {Z}_{i}\vert } \end{eqnarray*}\end{document}</tex-math>
<mml:math id="mml-eqn-6"><mml:mstyle displaystyle="true"><mml:mi>P</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Precision</mml:mi>
<mml:mfenced separators="" open="(" close=")"><mml:mi>H</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>D</mml:mi>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:munderover><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:munderover>
<mml:mfrac><mml:mrow><mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∩</mml:mo>
<mml:msub><mml:mrow><mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
</mml:math>
</alternatives>
</disp-formula>
<disp-formula id="eqn-7"><label>(7)</label>
<alternatives><graphic xlink:href="peerj-03-1279-e007.jpg" mimetype="image" mime-subtype="png" position="float" orientation="portrait"></graphic>
<tex-math id="M7">\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}\begin{eqnarray*} R=\mathrm{Recall}(H,D)=\frac{1}{\vert D\vert }\sum _{i=1}^{\vert D\vert }\frac{\vert {Y}_{i}\cap {Z}_{i}\vert }{\vert {Y}_{i}\vert }. \end{eqnarray*}\end{document}</tex-math>
<mml:math id="mml-eqn-7"><mml:mstyle displaystyle="true"><mml:mi>R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Recall</mml:mi>
<mml:mfenced separators="" open="(" close=")"><mml:mi>H</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>D</mml:mi>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mrow><mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:munderover><mml:mrow><mml:mo>∑</mml:mo>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:mi>D</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:munderover>
<mml:mfrac><mml:mrow><mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>∩</mml:mo>
<mml:msub><mml:mrow><mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow><mml:mo>|</mml:mo>
<mml:msub><mml:mrow><mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mstyle>
</mml:math>
</alternatives>
</disp-formula>
</p>
<p>We also use the F1-score, defined in the previous section.</p>
</sec>
</sec>
<sec><title>Approach</title>
<p>The approach presented consists in the classification—single-label and multi-label—of the two corpora of biomedical literature defined in ‘Dataset’ using a Wikipedia-based bag-of-concepts representation of documents, and the comparison of the performance with the performance of the classifier when using the traditional BoW representation of documents. We used the SVM algorithm (‘SVM algorithm’), and for the multi-label problem, we also made use of the strategy presented in ‘Multi-label classication methods’. With the aim of conducting all the experiments under the same conditions, we selected randomly for both corpora—single-label and multi-label versions—training sequences composed of 5,000 elements and test sequences that comprise 1,000 elements.</p>
<p>First, it was necessary to obtain the BoW and BoC representations of each document in the corpora. <xref ref-type="fig" rid="fig-2">Figure 2</xref>
 shows the differences between the creation of the traditional BoW representation and the BoC representation. In order to create the BoW representation of a document, the first step is to filter the stop words. Stop words are words such as “the”, “if”, and “or” that are of no use for text classification, since they probably occur in almost all documents. The next step is stemming, the removing of common inflexional affixes, in order to perform some form of morphological normalization to create more general features. To that end we use the <italic>Porter stemmer</italic>
 (<xref rid="ref-28" ref-type="bibr">Porter, 1980</xref>
), which is the most common stemming algorithm to work with English text (<xref rid="ref-37" ref-type="bibr">Täckström, 2005</xref>
). Finally, we calculate the frequency of occurrence of stemmed words.</p>
<fig id="fig-1" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-1</object-id>
<label>Figure 1</label>
<caption><title>UVigoMED corpus creation.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g001"></graphic>
</fig>
<fig id="fig-2" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-2</object-id>
<label>Figure 2</label>
<caption><title>Bag-of-words and bag-of-concepts creation process.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g002"></graphic>
</fig>
<p>To create the BoC, we opted for using a semantic annotator, in particular, a general-purpose semantic annotator that uses NLP techniques, machine learning, and Wikipedia as a knowledge base: <italic>Wikipedia Miner</italic>
 (<xref rid="ref-25" ref-type="bibr">Milne & Witten, 2013</xref>
). The implementation of its algorithm is based on three steps:</p>
<list list-type="simple" id="list-3"><list-item><label>•</label>
<p>The first step is <italic>candidate selection</italic>
. It consists in, given a text document composed of a set of <italic>n-grams</italic>
—being an <italic>n-gram</italic>
 continuous sequence of n words—the algorithm queries a vocabulary that comprises all the <italic>anchor texts</italic>
 in Wikipedia and verifies whether any of the <italic>n-grams</italic>
 are present in the vocabulary. Thus, for each matching <italic>n-gram-anchor text</italic>
 a candidate is obtained, being the most relevant candidates those that are most frequently used as <italic>anchor texts</italic>
 in Wikipedia.</p>
</list-item>
<list-item><label>•</label>
<p>The second step is <italic>disambiguation</italic>
. Given the same vocabulary of <italic>anchor texts</italic>
, the algorithm selects the most suitable target for each candidate. The process is performed making use of machine learning techniques and using Wikipedia articles as the training sequence, since they contain good examples of manually performed disambiguation. Disambiguation is accomplished having into account the relationship of each candidate with other non-ambiguous terms in its context, and also the commonness of the candidate.</p>
</list-item>
<list-item><label>•</label>
<p>The third step is <italic>link detection</italic>
, wherein the relevance of concepts extracted from the text is calculated. To that end, the algorithm uses again machine learning techniques and Wikipedia articles as the training sequence, since each of them is a good example of what constitutes a relevant link and what does not. <xref ref-type="fig" rid="fig-3">Figure 3</xref>
 shows graphically the whole process to obtain a bag-of-concepts—being each one of them a Wikipedia article—from a text document.</p>
</list-item>
</list>
<p>Having obtained the BoC representation for each of the documents we proceeded to classify the two corpora—both single-label and multi-label versions—making use of the strategies and algorithms defined in ‘Multi-label classication methods’ and ‘SVM algorithm’.</p>
<fig id="fig-3" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-3</object-id>
<label>Figure 3</label>
<caption><title>Bag-of-concepts obtainment process of a document using Wikipedia Miner.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g003"></graphic>
</fig>
</sec>
</sec>
<sec sec-type="results"><title>Results</title>
<sec><title>OHSUMED</title>
<p><xref ref-type="fig" rid="fig-4">Figure 4</xref>
 and <xref ref-type="table" rid="table-1">Table 1</xref>
 show the evolution of the F1-score for BoW and BoC, varying the length of the training sequence for the single-labelled OHSUMED corpus; and <xref ref-type="fig" rid="fig-5">Fig. 5</xref>
 and <xref ref-type="table" rid="table-2">Table 2</xref>
, show the F1-score for BoW and BoC, varying the length of the training sequence in the multi-labelled OHSUMED corpus. We can perceive that the performance offered by the classifier using the BoC representation is clearly superior to the one offered by the traditional BoW representation for both experiments—single-label and multi-label. As we can see in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>
, the BoC representation reaches improvements up to 157% for the single-label problem and up to 100% for the multi-label problem.</p>
<fig id="fig-4" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-4</object-id>
<label>Figure 4</label>
<caption><title>F1 score for BoW and BoC varying the length of the training sequence in single-labelled OHSUMED corpus.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g004"></graphic>
</fig>
<fig id="fig-5" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-5</object-id>
<label>Figure 5</label>
<caption><title>F1-score for BoW and BoC varying the length of the training sequence in multi-labelled OHSUMED corpus.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g005"></graphic>
</fig>
<table-wrap id="table-1" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/table-1</object-id>
<label>Table 1</label>
<caption><title>F1-score for BoW and BoC varying the length of the training sequence in single-labelled OHSUMED corpus.</title>
</caption>
<alternatives><graphic xlink:href="peerj-03-1279-g009"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
</colgroup>
<thead><tr><th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">5</th>
<th rowspan="1" colspan="1">10</th>
<th rowspan="1" colspan="1">20</th>
<th rowspan="1" colspan="1">50</th>
<th rowspan="1" colspan="1">100</th>
<th rowspan="1" colspan="1">200</th>
<th rowspan="1" colspan="1">500</th>
<th rowspan="1" colspan="1">1,000</th>
<th rowspan="1" colspan="1">2,000</th>
<th rowspan="1" colspan="1">5,000</th>
</tr>
</thead>
<tbody><tr><td rowspan="3" colspan="1">BoW</td>
<td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.058</td>
<td rowspan="1" colspan="1">0.080</td>
<td rowspan="1" colspan="1">0.102</td>
<td rowspan="1" colspan="1">0.160</td>
<td rowspan="1" colspan="1">0.218</td>
<td rowspan="1" colspan="1">0.276</td>
<td rowspan="1" colspan="1">0.345</td>
<td rowspan="1" colspan="1">0.418</td>
<td rowspan="1" colspan="1">0.467</td>
<td rowspan="1" colspan="1">0.528</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.129</td>
<td rowspan="1" colspan="1">0.074</td>
<td rowspan="1" colspan="1">0.106</td>
<td rowspan="1" colspan="1">0.163</td>
<td rowspan="1" colspan="1">0.248</td>
<td rowspan="1" colspan="1">0.307</td>
<td rowspan="1" colspan="1">0.377</td>
<td rowspan="1" colspan="1">0.426</td>
<td rowspan="1" colspan="1">0.471</td>
<td rowspan="1" colspan="1">0.519</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.027</td>
<td rowspan="1" colspan="1">0.021</td>
<td rowspan="1" colspan="1">0.030</td>
<td rowspan="1" colspan="1">0.050</td>
<td rowspan="1" colspan="1">0.101</td>
<td rowspan="1" colspan="1">0.125</td>
<td rowspan="1" colspan="1">0.189</td>
<td rowspan="1" colspan="1">0.254</td>
<td rowspan="1" colspan="1">0.320</td>
<td rowspan="1" colspan="1">0.425</td>
</tr>
<tr><td rowspan="3" colspan="1">BoC</td>
<td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.089</td>
<td rowspan="1" colspan="1">0.134</td>
<td rowspan="1" colspan="1">0.213</td>
<td rowspan="1" colspan="1">0.281</td>
<td rowspan="1" colspan="1">0.308</td>
<td rowspan="1" colspan="1">0.332</td>
<td rowspan="1" colspan="1">0.421</td>
<td rowspan="1" colspan="1">0.460</td>
<td rowspan="1" colspan="1">0.512</td>
<td rowspan="1" colspan="1">0.535</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.078</td>
<td rowspan="1" colspan="1">0.151</td>
<td rowspan="1" colspan="1">0.173</td>
<td rowspan="1" colspan="1">0.237</td>
<td rowspan="1" colspan="1">0.309</td>
<td rowspan="1" colspan="1">0.355</td>
<td rowspan="1" colspan="1">0.421</td>
<td rowspan="1" colspan="1">0.470</td>
<td rowspan="1" colspan="1">0.502</td>
<td rowspan="1" colspan="1">0.535</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.029</td>
<td rowspan="1" colspan="1">0.054</td>
<td rowspan="1" colspan="1">0.067</td>
<td rowspan="1" colspan="1">0.118</td>
<td rowspan="1" colspan="1">0.181</td>
<td rowspan="1" colspan="1">0.204</td>
<td rowspan="1" colspan="1">0.277</td>
<td rowspan="1" colspan="1">0.317</td>
<td rowspan="1" colspan="1">0.397</td>
<td rowspan="1" colspan="1">0.442</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="table-2" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/table-2</object-id>
<label>Table 2</label>
<caption><title>Hamming loss, precision, accuracy, recall and F1-score for BoW and BoC varying the length of the training sequence in multi-labelled OHSUMED corpus.</title>
</caption>
<alternatives><graphic xlink:href="peerj-03-1279-g010"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
</colgroup>
<thead><tr><th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">5</th>
<th rowspan="1" colspan="1">10</th>
<th rowspan="1" colspan="1">20</th>
<th rowspan="1" colspan="1">50</th>
<th rowspan="1" colspan="1">100</th>
<th rowspan="1" colspan="1">200</th>
<th rowspan="1" colspan="1">500</th>
<th rowspan="1" colspan="1">1,000</th>
<th rowspan="1" colspan="1">2,000</th>
<th rowspan="1" colspan="1">5,000</th>
</tr>
</thead>
<tbody><tr><td rowspan="5" colspan="1">BoW</td>
<td rowspan="1" colspan="1">HL</td>
<td rowspan="1" colspan="1">0.061</td>
<td rowspan="1" colspan="1">0.061</td>
<td rowspan="1" colspan="1">0.063</td>
<td rowspan="1" colspan="1">0.063</td>
<td rowspan="1" colspan="1">0.062</td>
<td rowspan="1" colspan="1">0.062</td>
<td rowspan="1" colspan="1">0.059</td>
<td rowspan="1" colspan="1">0.058</td>
<td rowspan="1" colspan="1">0.056</td>
<td rowspan="1" colspan="1">0.054</td>
</tr>
<tr><td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.180</td>
<td rowspan="1" colspan="1">0.063</td>
<td rowspan="1" colspan="1">0.147</td>
<td rowspan="1" colspan="1">0.300</td>
<td rowspan="1" colspan="1">0.380</td>
<td rowspan="1" colspan="1">0.461</td>
<td rowspan="1" colspan="1">0.507</td>
<td rowspan="1" colspan="1">0.532</td>
<td rowspan="1" colspan="1">0.560</td>
<td rowspan="1" colspan="1">0.571</td>
</tr>
<tr><td rowspan="1" colspan="1">A</td>
<td rowspan="1" colspan="1">0.000</td>
<td rowspan="1" colspan="1">0.000</td>
<td rowspan="1" colspan="1">0.016</td>
<td rowspan="1" colspan="1">0.026</td>
<td rowspan="1" colspan="1">0.030</td>
<td rowspan="1" colspan="1">0.049</td>
<td rowspan="1" colspan="1">0.121</td>
<td rowspan="1" colspan="1">0.156</td>
<td rowspan="1" colspan="1">0.198</td>
<td rowspan="1" colspan="1">0.192</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.002</td>
<td rowspan="1" colspan="1">0.031</td>
<td rowspan="1" colspan="1">0.047</td>
<td rowspan="1" colspan="1">0.056</td>
<td rowspan="1" colspan="1">0.082</td>
<td rowspan="1" colspan="1">0.200</td>
<td rowspan="1" colspan="1">0.288</td>
<td rowspan="1" colspan="1">0.385</td>
<td rowspan="1" colspan="1">0.482</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.002</td>
<td rowspan="1" colspan="1">0.019</td>
<td rowspan="1" colspan="1">0.028</td>
<td rowspan="1" colspan="1">0.041</td>
<td rowspan="1" colspan="1">0.080</td>
<td rowspan="1" colspan="1">0.172</td>
<td rowspan="1" colspan="1">0.244</td>
<td rowspan="1" colspan="1">0.343</td>
<td rowspan="1" colspan="1">0.424</td>
</tr>
<tr><td rowspan="5" colspan="1">BoC</td>
<td rowspan="1" colspan="1">HL</td>
<td rowspan="1" colspan="1">0.061</td>
<td rowspan="1" colspan="1">0.061</td>
<td rowspan="1" colspan="1">0.060</td>
<td rowspan="1" colspan="1">0.060</td>
<td rowspan="1" colspan="1">0.059</td>
<td rowspan="1" colspan="1">0.058</td>
<td rowspan="1" colspan="1">0.057</td>
<td rowspan="1" colspan="1">0.057</td>
<td rowspan="1" colspan="1">0.056</td>
<td rowspan="1" colspan="1">0.051</td>
</tr>
<tr><td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.021</td>
<td rowspan="1" colspan="1">0.063</td>
<td rowspan="1" colspan="1">0.300</td>
<td rowspan="1" colspan="1">0.415</td>
<td rowspan="1" colspan="1">0.457</td>
<td rowspan="1" colspan="1">0.526</td>
<td rowspan="1" colspan="1">0.543</td>
<td rowspan="1" colspan="1">0.553</td>
<td rowspan="1" colspan="1">0.556</td>
<td rowspan="1" colspan="1">0.591</td>
</tr>
<tr><td rowspan="1" colspan="1">A</td>
<td rowspan="1" colspan="1">0.000</td>
<td rowspan="1" colspan="1">0.000</td>
<td rowspan="1" colspan="1">.0033</td>
<td rowspan="1" colspan="1">0.060</td>
<td rowspan="1" colspan="1">0.077</td>
<td rowspan="1" colspan="1">0.111</td>
<td rowspan="1" colspan="1">0.148</td>
<td rowspan="1" colspan="1">0.171</td>
<td rowspan="1" colspan="1">0.184</td>
<td rowspan="1" colspan="1">0.202</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.054</td>
<td rowspan="1" colspan="1">0.085</td>
<td rowspan="1" colspan="1">0.121</td>
<td rowspan="1" colspan="1">0.182</td>
<td rowspan="1" colspan="1">0.273</td>
<td rowspan="1" colspan="1">0.340</td>
<td rowspan="1" colspan="1">0.404</td>
<td rowspan="1" colspan="1">0.481</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.038</td>
<td rowspan="1" colspan="1">0.042</td>
<td rowspan="1" colspan="1">0.080</td>
<td rowspan="1" colspan="1">0.156</td>
<td rowspan="1" colspan="1">0.238</td>
<td rowspan="1" colspan="1">0.289</td>
<td rowspan="1" colspan="1">0.364</td>
<td rowspan="1" colspan="1">0.438</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
</sec>
<sec><title>UVigoMED</title>
<p><xref ref-type="fig" rid="fig-6">Figure 6</xref>
 and <xref ref-type="table" rid="table-3">Table 3</xref>
 show the evolution of the F1-score for BoW and BoC when varying the length of the training sequence for the single-label version of the UVigoMED corpus. We can see that the performance offered by the classifier when using the BoC representation is much higher than that offered when using the BoW one, reaching improvements up to 122%, as shown in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>
. The experiments conducted with the multi-label corpus provide the results shown in <xref ref-type="fig" rid="fig-7">Fig. 7</xref>
 and <xref ref-type="table" rid="table-4">Table 4</xref>
, where we can see again that the BoC representation outperforms BoW, reaching increases up to 155%.</p>
<fig id="fig-6" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-6</object-id>
<label>Figure 6</label>
<caption><title>F1 score for BoW and BoC varying the length of the training sequence in single-labelled UVigoMED corpus.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g006"></graphic>
</fig>
<fig id="fig-7" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-7</object-id>
<label>Figure 7</label>
<caption><title>F1-score for BoW and BoC varying the length of the training sequence in multi-labelled UVigoMED corpus.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g007"></graphic>
</fig>
<fig id="fig-8" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/fig-8</object-id>
<label>Figure 8</label>
<caption><title>F1-score percentage improvement for single-labelled OHSUMED, multi-labelled OHSUMED, single-labelled UVigoMED and multi-labelled UVigoMED according to training sequence length variation.</title>
</caption>
<graphic xlink:href="peerj-03-1279-g008"></graphic>
</fig>
<table-wrap id="table-3" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/table-3</object-id>
<label>Table 3</label>
<caption><title>F1-score for BoW and BoC varying the length of the training sequence in single-labelled UVigoMED corpus.</title>
</caption>
<alternatives><graphic xlink:href="peerj-03-1279-g011"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
</colgroup>
<thead><tr><th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">5</th>
<th rowspan="1" colspan="1">10</th>
<th rowspan="1" colspan="1">20</th>
<th rowspan="1" colspan="1">50</th>
<th rowspan="1" colspan="1">100</th>
<th rowspan="1" colspan="1">200</th>
<th rowspan="1" colspan="1">500</th>
<th rowspan="1" colspan="1">1,000</th>
<th rowspan="1" colspan="1">2,000</th>
<th rowspan="1" colspan="1">5,000</th>
</tr>
</thead>
<tbody><tr><td rowspan="3" colspan="1">BoW</td>
<td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.059</td>
<td rowspan="1" colspan="1">0.122</td>
<td rowspan="1" colspan="1">0.102</td>
<td rowspan="1" colspan="1">0.116</td>
<td rowspan="1" colspan="1">0.179</td>
<td rowspan="1" colspan="1">0.276</td>
<td rowspan="1" colspan="1">0.377</td>
<td rowspan="1" colspan="1">0.460</td>
<td rowspan="1" colspan="1">0.518</td>
<td rowspan="1" colspan="1">0.629</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.060</td>
<td rowspan="1" colspan="1">0.074</td>
<td rowspan="1" colspan="1">0.097</td>
<td rowspan="1" colspan="1">0.150</td>
<td rowspan="1" colspan="1">0.183</td>
<td rowspan="1" colspan="1">0.272</td>
<td rowspan="1" colspan="1">0.397</td>
<td rowspan="1" colspan="1">0.457</td>
<td rowspan="1" colspan="1">0.511</td>
<td rowspan="1" colspan="1">0.631</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.026</td>
<td rowspan="1" colspan="1">0.027</td>
<td rowspan="1" colspan="1">0.035</td>
<td rowspan="1" colspan="1">0.061</td>
<td rowspan="1" colspan="1">0.084</td>
<td rowspan="1" colspan="1">0.136</td>
<td rowspan="1" colspan="1">0.220</td>
<td rowspan="1" colspan="1">0.283</td>
<td rowspan="1" colspan="1">0.360</td>
<td rowspan="1" colspan="1">0.421</td>
</tr>
<tr><td rowspan="3" colspan="1">BoC</td>
<td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.095</td>
<td rowspan="1" colspan="1">0.222</td>
<td rowspan="1" colspan="1">0.284</td>
<td rowspan="1" colspan="1">0.259</td>
<td rowspan="1" colspan="1">0.308</td>
<td rowspan="1" colspan="1">0.436</td>
<td rowspan="1" colspan="1">0.500</td>
<td rowspan="1" colspan="1">0.544</td>
<td rowspan="1" colspan="1">0.586</td>
<td rowspan="1" colspan="1">0.594</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.049</td>
<td rowspan="1" colspan="1">0.093</td>
<td rowspan="1" colspan="1">0.148</td>
<td rowspan="1" colspan="1">0.247</td>
<td rowspan="1" colspan="1">0.321</td>
<td rowspan="1" colspan="1">0.432</td>
<td rowspan="1" colspan="1">0.515</td>
<td rowspan="1" colspan="1">0.557</td>
<td rowspan="1" colspan="1">0.590</td>
<td rowspan="1" colspan="1">0.598</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.017</td>
<td rowspan="1" colspan="1">0.040</td>
<td rowspan="1" colspan="1">0.078</td>
<td rowspan="1" colspan="1">0.116</td>
<td rowspan="1" colspan="1">0.179</td>
<td rowspan="1" colspan="1">0.269</td>
<td rowspan="1" colspan="1">0.331</td>
<td rowspan="1" colspan="1">0.390</td>
<td rowspan="1" colspan="1">0.430</td>
<td rowspan="1" colspan="1">0.467</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="table-4" orientation="portrait" position="float"><object-id pub-id-type="doi">10.7717/peerj.1279/table-4</object-id>
<label>Table 4</label>
<caption><title>Hamming loss, precision, accuracy, recall and F1-score for BoW and BoC varying the length of the training sequence in multi-labelled UVigoMED corpus.</title>
</caption>
<alternatives><graphic xlink:href="peerj-03-1279-g012"></graphic>
<table frame="hsides" rules="groups"><colgroup span="1"><col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
<col span="1"></col>
</colgroup>
<thead><tr><th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">5</th>
<th rowspan="1" colspan="1">10</th>
<th rowspan="1" colspan="1">20</th>
<th rowspan="1" colspan="1">50</th>
<th rowspan="1" colspan="1">100</th>
<th rowspan="1" colspan="1">200</th>
<th rowspan="1" colspan="1">500</th>
<th rowspan="1" colspan="1">1,000</th>
<th rowspan="1" colspan="1">2,000</th>
<th rowspan="1" colspan="1">5,000</th>
</tr>
</thead>
<tbody><tr><td rowspan="5" colspan="1">BoW</td>
<td rowspan="1" colspan="1">HL</td>
<td rowspan="1" colspan="1">0.087</td>
<td rowspan="1" colspan="1">0.090</td>
<td rowspan="1" colspan="1">0.068</td>
<td rowspan="1" colspan="1">0.064</td>
<td rowspan="1" colspan="1">0.065</td>
<td rowspan="1" colspan="1">0.065</td>
<td rowspan="1" colspan="1">0.061</td>
<td rowspan="1" colspan="1">0.059</td>
<td rowspan="1" colspan="1">0.056</td>
<td rowspan="1" colspan="1">0.055</td>
</tr>
<tr><td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.069</td>
<td rowspan="1" colspan="1">0.033</td>
<td rowspan="1" colspan="1">0.114</td>
<td rowspan="1" colspan="1">0.107</td>
<td rowspan="1" colspan="1">0.160</td>
<td rowspan="1" colspan="1">0.328</td>
<td rowspan="1" colspan="1">0.489</td>
<td rowspan="1" colspan="1">0.544</td>
<td rowspan="1" colspan="1">0.573</td>
<td rowspan="1" colspan="1">0.568</td>
</tr>
<tr><td rowspan="1" colspan="1">A</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.010</td>
<td rowspan="1" colspan="1">0.090</td>
<td rowspan="1" colspan="1">0.011</td>
<td rowspan="1" colspan="1">0.019</td>
<td rowspan="1" colspan="1">0.034</td>
<td rowspan="1" colspan="1">0.083</td>
<td rowspan="1" colspan="1">0.136</td>
<td rowspan="1" colspan="1">0.182</td>
<td rowspan="1" colspan="1">0.207</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.024</td>
<td rowspan="1" colspan="1">0.031</td>
<td rowspan="1" colspan="1">0.015</td>
<td rowspan="1" colspan="1">0.016</td>
<td rowspan="1" colspan="1">0.029</td>
<td rowspan="1" colspan="1">0.062</td>
<td rowspan="1" colspan="1">0.152</td>
<td rowspan="1" colspan="1">0.229</td>
<td rowspan="1" colspan="1">0.312</td>
<td rowspan="1" colspan="1">0.414</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.010</td>
<td rowspan="1" colspan="1">0.013</td>
<td rowspan="1" colspan="1">0.014</td>
<td rowspan="1" colspan="1">0.020</td>
<td rowspan="1" colspan="1">0.029</td>
<td rowspan="1" colspan="1">0.064</td>
<td rowspan="1" colspan="1">0.150</td>
<td rowspan="1" colspan="1">0.223</td>
<td rowspan="1" colspan="1">0.311</td>
<td rowspan="1" colspan="1">0.386</td>
</tr>
<tr><td rowspan="5" colspan="1">BoC</td>
<td rowspan="1" colspan="1">HL</td>
<td rowspan="1" colspan="1">0.086</td>
<td rowspan="1" colspan="1">0.071</td>
<td rowspan="1" colspan="1">0.062</td>
<td rowspan="1" colspan="1">0.061</td>
<td rowspan="1" colspan="1">0.060</td>
<td rowspan="1" colspan="1">0.060</td>
<td rowspan="1" colspan="1">0.056</td>
<td rowspan="1" colspan="1">0.054</td>
<td rowspan="1" colspan="1">0.052</td>
<td rowspan="1" colspan="1">0.052</td>
</tr>
<tr><td rowspan="1" colspan="1">P</td>
<td rowspan="1" colspan="1">0.001</td>
<td rowspan="1" colspan="1">0.140</td>
<td rowspan="1" colspan="1">0.225</td>
<td rowspan="1" colspan="1">0.186</td>
<td rowspan="1" colspan="1">0.411</td>
<td rowspan="1" colspan="1">0.536</td>
<td rowspan="1" colspan="1">0.589</td>
<td rowspan="1" colspan="1">0.601</td>
<td rowspan="1" colspan="1">0.606</td>
<td rowspan="1" colspan="1">0.599</td>
</tr>
<tr><td rowspan="1" colspan="1">A</td>
<td rowspan="1" colspan="1">0.022</td>
<td rowspan="1" colspan="1">0.014</td>
<td rowspan="1" colspan="1">0.009</td>
<td rowspan="1" colspan="1">0.017</td>
<td rowspan="1" colspan="1">0.043</td>
<td rowspan="1" colspan="1">0.081</td>
<td rowspan="1" colspan="1">0.156</td>
<td rowspan="1" colspan="1">0.199</td>
<td rowspan="1" colspan="1">0.217</td>
<td rowspan="1" colspan="1">0.237</td>
</tr>
<tr><td rowspan="1" colspan="1">R</td>
<td rowspan="1" colspan="1">0.021</td>
<td rowspan="1" colspan="1">0.021</td>
<td rowspan="1" colspan="1">0.014</td>
<td rowspan="1" colspan="1">0.023</td>
<td rowspan="1" colspan="1">0.069</td>
<td rowspan="1" colspan="1">0.138</td>
<td rowspan="1" colspan="1">0.282</td>
<td rowspan="1" colspan="1">0.364</td>
<td rowspan="1" colspan="1">0.414</td>
<td rowspan="1" colspan="1">0.454</td>
</tr>
<tr><td rowspan="1" colspan="1">F1</td>
<td rowspan="1" colspan="1">0.004</td>
<td rowspan="1" colspan="1">0.010</td>
<td rowspan="1" colspan="1">0.017</td>
<td rowspan="1" colspan="1">0.028</td>
<td rowspan="1" colspan="1">0.074</td>
<td rowspan="1" colspan="1">0.136</td>
<td rowspan="1" colspan="1">0.269</td>
<td rowspan="1" colspan="1">0.340</td>
<td rowspan="1" colspan="1">0.373</td>
<td rowspan="1" colspan="1">0.409</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
</sec>
</sec>
<sec sec-type="discussion"><title>Discussion</title>
<p>The results presented in the previous section clearly show the increase in performance of a SVM classifier for categorising biomedical literature when using a Wikipedia-based bag-of-concepts document representation instead of the classical representation based-on-words. It is worth noting that, as can be seen in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>
, the highest increases occur when training sequences are short, because, with enough data, the problems of synonymy and polysemy are masked, and surface overlap performs well.</p>
<p>The increase in classifiers’ performance yields important benefits for users—fundamentally medical staff, researchers and students—since a suitable and correct categorisation facilitates access to those biomedical articles that are really of interest, thus reducing the time needed to find them.</p>
<p>Comparing the proposed approach to other similar approaches in the literature is not an easy task, due to the lack of biomedical literature classification systems that use a general-purpose semantic annotator, the variety of corpora—and subsets of them—employed, the variety of classification algorithms employed, and the different performance measures used. The only work that uses a general-purpose semantic annotator to classify biomedical literature is <xref rid="ref-15" ref-type="bibr">Huang & Milne (2012)</xref>
, who classify a subset of MEDLINE—Med100, without specifying whether it is single or multi-label—using a KNN algorithm, and with a proportion of training documents similar to our work (83%). The authors report a F1-score—they do not specify whether it is macro or micro—about 53%. Regarding the use of domain-specific semantic annotators to create representations of documents in biomedical literature classification tasks, we can cite the work of <xref rid="ref-49" ref-type="bibr">Zhou, Zhang & Hu (2008b)</xref>
, where the authors classify, using a Naïve Bayes algorithm, a subset of OHSUMED corpus comprising only 7,400 documents of the year 1991 that belong to just one category from a total of 14—instead of the 23 that comprises the original OHSUMED corpus—obtaining a macro F1-score of 64%, and using as training sequence 33% of documents of the corpus; and <xref rid="ref-44" ref-type="bibr">Yetisgen-Yildiz & Pratt (2005)</xref>
, where the authors use an SVM to classify a non-standard subset of OHSUMED corpus composed of 179,796 titles of biomedical articles belonging to 1,928 MeSH categories—without specifying if it is single or multi-label—providing micro F1-score values of 57%.</p>
<p>Finally, the study leaves open lines to future research. The work presented in this paper may be extended by applying the classifier and document representation proposed to the classification of medical histories and patient records, using the proposed document representation along with other classification strategies and algorithms, experimenting with other semantic annotators, and conducting more experiments with other corpora. Another possible future line is the design and development of a software application that allows the visualisation of the documents classified according to the proposal presented in this paper. Thus, users may interact with the application and perform exploratory searches through the categories in which documents are classified. In addition, this will allow us to receive input from users about the results of the classification.</p>
</sec>
<sec sec-type="conclusions"><title>Conclusions</title>
<p>This study presents the benefits of using a Wikipedia-based bag-of-concepts document representation and its application to the SVM classification algorithm to classify biomedical literature into a predefined set of categories. The experiments conducted showed that the BoC representation outperforms the classical BoW representation by up to 157% for the single-label problem and up to 100% for the multi-label problem for the OHSUMED corpus. In addition, we created a purpose-built corpus—UVigoMED—that comprises biomedical articles belonging to MEDLINE of year 2014, in which the performance of the classifier using the BoC representation outperforms BoW by up to 122% for the single-label problem and up to 155% in the multi-label problem.</p>
<p>In consequence, we conclude that a Wikipedia-based bag-of-concepts document representation is superior to a baseline BoW representation when it comes to classifying biomedical literature. This is especially true when training sequences are short.</p>
</sec>
</body>
<back><sec sec-type="additional-information"><title>Additional Information and Declarations</title>
<fn-group content-type="competing-interests"><title>Competing Interests</title>
<fn id="conflict-1" fn-type="conflict"><p>The authors declare there are no competing interests.</p>
</fn>
</fn-group>
<fn-group content-type="author-contributions"><title>Author Contributions</title>
<fn id="contribution-1" fn-type="con"><p><xref ref-type="contrib" rid="author-1">Marcos Antonio Mouriño García</xref>
 and <xref ref-type="contrib" rid="author-2">Roberto Pérez Rodríguez</xref>
 conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.</p>
</fn>
<fn id="contribution-2" fn-type="con"><p><xref ref-type="contrib" rid="author-3">Luis E. Anido Rifón</xref>
 contributed reagents/materials/analysis tools, reviewed drafts of the paper.</p>
</fn>
</fn-group>
<fn-group content-type="other"><title>Data Availability</title>
<fn id="addinfo-1" fn-type="other"><p>The following information was supplied regarding data availability:</p>
<p>UVigoMED Corpus: <uri xlink:href="http://itec-sde.net/UVigoMED.zip">http://itec-sde.net/UVigoMED.zip</uri>
.</p>
</fn>
</fn-group>
</sec>
<ref-list content-type="authoryear"><title>References</title>
<ref id="ref-1"><label>Aronson (2001)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Aronson</surname>
<given-names>AR</given-names>
</name>
</person-group>
<article-title>Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</article-title>
<source>AMIA Annual Symposium Proceedings</source>
<year>2001</year>
<fpage>17</fpage>
<lpage>21</lpage>
</element-citation>
</ref>
<ref id="ref-2"><label>Blei, Ng & Jordan (2003)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Blei</surname>
<given-names>DM</given-names>
</name>
<name><surname>Ng</surname>
<given-names>AY</given-names>
</name>
<name><surname>Jordan</surname>
<given-names>MI</given-names>
</name>
</person-group>
<article-title>Latent Dirichlet Allocation</article-title>
<source>Journal of Machine Learning Research</source>
<year>2003</year>
<volume>3</volume>
<fpage>993</fpage>
<lpage>1022</lpage>
</element-citation>
</ref>
<ref id="ref-3"><label>Blizard (1988)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Blizard</surname>
<given-names>WD</given-names>
</name>
</person-group>
<article-title>Multiset theory</article-title>
<source>Notre Dame Journal of Formal Logic</source>
<issue>1</issue>
<year>1988</year>
<volume>30</volume>
<fpage>36</fpage>
<lpage>66</lpage>
<pub-id pub-id-type="doi">10.1305/ndjfl/1093634995</pub-id>
</element-citation>
</ref>
<ref id="ref-4"><label>Bloehdorn & Hotho (2004)</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Bloehdorn</surname>
<given-names>S</given-names>
</name>
<name><surname>Hotho</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Boosting for text classification with semantic features</article-title>
<source>WebKDD</source>
<volume>Vol. 3932</volume>
<year>2004</year>
<publisher-name>Springer</publisher-name>
<fpage>149</fpage>
<lpage>166</lpage>
<pub-id pub-id-type="doi">10.1007/11899402_10</pub-id>
</element-citation>
</ref>
<ref id="ref-5"><label>Bodenreider (2004)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bodenreider</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
<source>Nucleic Acids Research</source>
<year>2004</year>
<volume>32</volume>
<fpage>D267</fpage>
<lpage>D270</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh061</pub-id>
<pub-id pub-id-type="pmid">14681409</pub-id>
</element-citation>
</ref>
<ref id="ref-6"><label>Dai et al. (2008)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Dai</surname>
<given-names>M</given-names>
</name>
<name><surname>Shah</surname>
<given-names>NH</given-names>
</name>
<name><surname>Xuan</surname>
<given-names>W</given-names>
</name>
<name><surname>Musen</surname>
<given-names>MA</given-names>
</name>
<name><surname>Watson</surname>
<given-names>SJ</given-names>
</name>
<name><surname>Athey</surname>
<given-names>BD</given-names>
</name>
<name><surname>Meng</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>An efficient solution for mapping free text to ontology terms</article-title>
<conf-name>AMIA summit on translational bioinformatics, San Francisco CA</conf-name>
<volume>21</volume>
<year>2008</year>
<comment><italic>Available at <uri xlink:href="http://knowledge.amia.org/amia-55142-tbi2008a-1.650887/t-002-1.985042/f-001-1.985043/a-041-1.985157/an-041-1.985158?qr=1">http://knowledge.amia.org/amia-55142-tbi2008a-1.650887/t-002-1.985042/f-001-1.985043/a-041-1.985157/an-041-1.985158?qr=1</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-7"><label>Deerwester et al. (1990)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Deerwester</surname>
<given-names>S</given-names>
</name>
<name><surname>Dumais</surname>
<given-names>ST</given-names>
</name>
<name><surname>Furnas</surname>
<given-names>GW</given-names>
</name>
<name><surname>Landauer</surname>
<given-names>TK</given-names>
</name>
<name><surname>Harshman</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Indexing by latent semantic analysis</article-title>
<source>Journal of the American Society for Information Science</source>
<year>1990</year>
<volume>41</volume>
<fpage>391</fpage>
<lpage>407</lpage>
<pub-id pub-id-type="doi">10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9</pub-id>
</element-citation>
</ref>
<ref id="ref-8"><label>Egozi, Markovitch & Gabrilovich (2011)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Egozi</surname>
<given-names>O</given-names>
</name>
<name><surname>Markovitch</surname>
<given-names>S</given-names>
</name>
<name><surname>Gabrilovich</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Concept-based information retrieval using explicit semantic analysis</article-title>
<source>ACM Transactions on Information Systems</source>
<issue>2</issue>
<year>2011</year>
<volume>29</volume>
<fpage>1</fpage>
<lpage>38</lpage>
<pub-id pub-id-type="doi">10.1145/1961209.1961211</pub-id>
</element-citation>
</ref>
<ref id="ref-9"><label>Elkin et al. (1988)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Elkin</surname>
<given-names>PL</given-names>
</name>
<name><surname>Cimino</surname>
<given-names>JJ</given-names>
</name>
<name><surname>Lowe</surname>
<given-names>HJ</given-names>
</name>
<name><surname>Aronow</surname>
<given-names>DB</given-names>
</name>
<name><surname>Payne</surname>
<given-names>TH</given-names>
</name>
<name><surname>Pincetl</surname>
<given-names>PS</given-names>
</name>
<name><surname>Barnett</surname>
<given-names>GO</given-names>
</name>
</person-group>
<article-title>Mapping to MeSH: the art of trapping MeSH equivalence from within narrative text</article-title>
<conf-name>Proceedings of the annual symposium on computer application in medical care</conf-name>
<year>1988</year>
<fpage>185</fpage>
<lpage>190</lpage>
</element-citation>
</ref>
<ref id="ref-10"><label>Gabrilovich & Markovitch (2007)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Gabrilovich</surname>
<given-names>E</given-names>
</name>
<name><surname>Markovitch</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis</article-title>
<conf-name>Proceedings of the 20th international joint conference on artificial intelligence</conf-name>
<year>2007</year>
<fpage>1606</fpage>
<lpage>1611</lpage>
</element-citation>
</ref>
<ref id="ref-11"><label>Gabrilovich & Markovitch (2009)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gabrilovich</surname>
<given-names>E</given-names>
</name>
<name><surname>Markovitch</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Wikipedia-based semantic interpretation for natural language processing</article-title>
<source>Journal of Artificial Intelligence Research</source>
<year>2009</year>
<volume>34</volume>
<fpage>443</fpage>
<lpage>498</lpage>
</element-citation>
</ref>
<ref id="ref-12"><label>Godbole & Sarawagi (2004)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Godbole</surname>
<given-names>S</given-names>
</name>
<name><surname>Sarawagi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Discriminative methods for multi-labeled classification</article-title>
<source>Advances in knowledge discovery and data</source>
<series>Lecture notes in computer science</series>
<volume>vol. 3056</volume>
<year>2004</year>
<fpage>22</fpage>
<lpage>30</lpage>
<comment><italic>Available at <uri xlink:href="http://www.springerlink.com/index/maa4ag38jd3pwrc0.pdf">http://www.springerlink.com/index/maa4ag38jd3pwrc0.pdf</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-13"><label>Harris (1968)</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Harris</surname>
<given-names>ZS</given-names>
</name>
</person-group>
<source>Mathematical structures of language</source>
<year>1968</year>
</element-citation>
</ref>
<ref id="ref-14"><label>Hearst et al. (1998)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hearst</surname>
<given-names>M</given-names>
</name>
<name><surname>Dumais</surname>
<given-names>S</given-names>
</name>
<name><surname>Osman</surname>
<given-names>E</given-names>
</name>
<name><surname>Platt</surname>
<given-names>J</given-names>
</name>
<name><surname>Scholkopf</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Support vector machines</article-title>
<source>Intelligent Systems and their Applications, IEEE</source>
<issue>4</issue>
<year>1998</year>
<volume>13</volume>
<fpage>18</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="doi">10.1109/5254.708428</pub-id>
</element-citation>
</ref>
<ref id="ref-15"><label>Huang & Milne (2012)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname>
<given-names>L</given-names>
</name>
<name><surname>Milne</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Learning a concept-based document similarity measure</article-title>
<source>Journal of the American Society for Information Science and Technology</source>
<year>2012</year>
<volume>63</volume>
<fpage>1593</fpage>
<lpage>1608</lpage>
<pub-id pub-id-type="doi">10.1002/asi.22689</pub-id>
</element-citation>
</ref>
<ref id="ref-16"><label>Joachims (1998)</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Joachims</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Text categorization with support vector machines: learning with many relevant features</article-title>
<source>Machine learning: ECML-98</source>
<volume>Vol. 1398</volume>
<year>1998</year>
<publisher-loc>New York</publisher-loc>
<publisher-name>Springer</publisher-name>
<fpage>137</fpage>
<lpage>142</lpage>
<comment><italic>Available at <uri xlink:href="http://link.springer.com/chapter/10.1007%2FBFb0026683">http://link.springer.com/chapter/10.1007%2FBFb0026683</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-17"><label>Jonquet, Shah & Musen (2009)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jonquet</surname>
<given-names>C</given-names>
</name>
<name><surname>Shah</surname>
<given-names>NH</given-names>
</name>
<name><surname>Musen</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>The open biomedical annotator</article-title>
<source>Summit on Translational Bioinformatics</source>
<year>2009</year>
<volume>2009</volume>
<fpage>56</fpage>
<lpage>60</lpage>
<pub-id pub-id-type="pmid">21347171</pub-id>
</element-citation>
</ref>
<ref id="ref-18"><label>Kang et al. (2012)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kang</surname>
<given-names>N</given-names>
</name>
<name><surname>Afzal</surname>
<given-names>Z</given-names>
</name>
<name><surname>Singh</surname>
<given-names>B</given-names>
</name>
<name><surname>Van Mulligen</surname>
<given-names>EM</given-names>
</name>
<name><surname>Kors</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>Using an ensemble system to improve concept extraction from clinical records</article-title>
<source>Journal of Biomedical Informatics</source>
<year>2012</year>
<volume>45</volume>
<fpage>423</fpage>
<lpage>428</lpage>
<pub-id pub-id-type="doi">10.1016/j.jbi.2011.12.009</pub-id>
<pub-id pub-id-type="pmid">22239956</pub-id>
</element-citation>
</ref>
<ref id="ref-19"><label>Kim, Howland & Park (2005)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname>
<given-names>H</given-names>
</name>
<name><surname>Howland</surname>
<given-names>P</given-names>
</name>
<name><surname>Park</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Dimension reduction in text classification with support vector machines</article-title>
<source>Journal of Machine Learning Research</source>
<year>2005</year>
<volume>6</volume>
<fpage>37</fpage>
<lpage>53</lpage>
</element-citation>
</ref>
<ref id="ref-20"><label>Landauer & Dumais (1997)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Landauer</surname>
<given-names>TK</given-names>
</name>
<name><surname>Dumais</surname>
<given-names>ST</given-names>
</name>
</person-group>
<article-title>A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>
<source>Psychological Review</source>
<issue>2</issue>
<year>1997</year>
<volume>104</volume>
<fpage>211</fpage>
<lpage>240</lpage>
<pub-id pub-id-type="doi">10.1037/0033-295X.104.2.211</pub-id>
</element-citation>
</ref>
<ref id="ref-21"><label>Levelt (1993)</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Levelt</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<source>Speaking: from intention to articulation</source>
<series>ACL-MIT press series in natural-language processing</series>
<volume>vol. 1</volume>
<year>1993</year>
<publisher-loc>Cambridge</publisher-loc>
<publisher-name>MIT Press</publisher-name>
<comment><italic>Available at <uri xlink:href="https://mitpress.mit.edu/books/speaking">https://mitpress.mit.edu/books/speaking</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-22"><label>Lipscomb (2000)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lipscomb</surname>
<given-names>CE</given-names>
</name>
</person-group>
<article-title>Medical subject headings (MeSH)</article-title>
<source>Bulletin of the Medical Library Association</source>
<issue>3</issue>
<year>2000</year>
<volume>88</volume>
<fpage>265</fpage>
<lpage>266</lpage>
<pub-id pub-id-type="pmid">10928714</pub-id>
</element-citation>
</ref>
<ref id="ref-23"><label>Lowe & Barnett (1994)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lowe</surname>
<given-names>HJ</given-names>
</name>
<name><surname>Barnett</surname>
<given-names>GO</given-names>
</name>
</person-group>
<article-title>Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches</article-title>
<source>Journal of the American Medical Association</source>
<year>1994</year>
<volume>271</volume>
<fpage>1103</fpage>
<lpage>1108</lpage>
<pub-id pub-id-type="doi">10.1001/jama.1994.03510380059038</pub-id>
<pub-id pub-id-type="pmid">8151853</pub-id>
</element-citation>
</ref>
<ref id="ref-24"><label>Medelyan, Witten & Milne (2008)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Medelyan</surname>
<given-names>O</given-names>
</name>
<name><surname>Witten</surname>
<given-names>IH</given-names>
</name>
<name><surname>Milne</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Topic indexing with Wikipedia</article-title>
<conf-name>Proceedings of the AAAI WikiAI workshop</conf-name>
<year>2008</year>
<fpage>19</fpage>
<lpage>24</lpage>
</element-citation>
</ref>
<ref id="ref-25"><label>Milne & Witten (2013)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Milne</surname>
<given-names>D</given-names>
</name>
<name><surname>Witten</surname>
<given-names>IH</given-names>
</name>
</person-group>
<article-title>An open-source toolkit for mining Wikipedia</article-title>
<source>Artificial Intelligence</source>
<year>2013</year>
<volume>194</volume>
<fpage>222</fpage>
<lpage>239</lpage>
<pub-id pub-id-type="doi">10.1016/j.artint.2012.06.007</pub-id>
</element-citation>
</ref>
<ref id="ref-26"><label>Pedregosa et al. (2012)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pedregosa</surname>
<given-names>F</given-names>
</name>
<name><surname>Varoquaux</surname>
<given-names>G</given-names>
</name>
<name><surname>Gramfort</surname>
<given-names>A</given-names>
</name>
<name><surname>Michel</surname>
<given-names>V</given-names>
</name>
<name><surname>Thirion</surname>
<given-names>B</given-names>
</name>
<name><surname>Grisel</surname>
<given-names>O</given-names>
</name>
<name><surname>Blondel</surname>
<given-names>M</given-names>
</name>
<name><surname>Prettenhofer</surname>
<given-names>P</given-names>
</name>
<name><surname>Weiss</surname>
<given-names>R</given-names>
</name>
<name><surname>Dubourg</surname>
<given-names>V</given-names>
</name>
<name><surname>Vanderplas</surname>
<given-names>J</given-names>
</name>
<name><surname>Passos</surname>
<given-names>A</given-names>
</name>
<name><surname>Cournapeau</surname>
<given-names>D</given-names>
</name>
<name><surname>Brucher</surname>
<given-names>M</given-names>
</name>
<name><surname>Perrot</surname>
<given-names>M</given-names>
</name>
<name><surname>Duchesnay</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Scikit-learn: machine learning in python</article-title>
<source>Journal of Machine Learning Research</source>
<year>2012</year>
<volume>12</volume>
<fpage>2825</fpage>
<lpage>2830</lpage>
</element-citation>
</ref>
<ref id="ref-27"><label>Phan, Nguyen & Horiguchi (2008)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Phan</surname>
<given-names>X-H</given-names>
</name>
<name><surname>Nguyen</surname>
<given-names>L-M</given-names>
</name>
<name><surname>Horiguchi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Learning to classify short and sparse text & web with hidden topics from large-scale data collections</article-title>
<conf-name>Proceedings of the 17th international conference on World Wide Web - WWW ’08</conf-name>
<year>2008</year>
<fpage>91</fpage>
<lpage>100</lpage>
<comment><italic>Available at <uri xlink:href="http://portal.acm.org/citation.cfm?doid=1367497.1367510">http://portal.acm.org/citation.cfm?doid=1367497.1367510</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-28"><label>Porter (1980)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Porter</surname>
<given-names>MF</given-names>
</name>
</person-group>
<article-title>An algorithm for suffix stripping</article-title>
<source>Program</source>
<issue>3</issue>
<year>1980</year>
<volume>14</volume>
<fpage>130</fpage>
<lpage>137</lpage>
<pub-id pub-id-type="doi">10.1108/eb046814</pub-id>
</element-citation>
</ref>
<ref id="ref-29"><label>Rigutini, Maggini & Liu (2005)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Rigutini</surname>
<given-names>L</given-names>
</name>
<name><surname>Maggini</surname>
<given-names>M</given-names>
</name>
<name><surname>Liu</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>An EM based training algorithm for cross-language text categorization</article-title>
<conf-name>Proceedings—2005 IEEE/WIC/ACM international conference on web intelligence, WI 2005</conf-name>
<volume>2005</volume>
<year>2005</year>
<fpage>529</fpage>
<lpage>535</lpage>
</element-citation>
</ref>
<ref id="ref-30"><label>Sahlgren (2008)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sahlgren</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>The distributional hypothesis</article-title>
<source>Italian Journal of Linguistics</source>
<issue>1</issue>
<year>2008</year>
<volume>20</volume>
<fpage>33</fpage>
<lpage>54</lpage>
</element-citation>
</ref>
<ref id="ref-31"><label>Sahlgren & Cöster (2004)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Sahlgren</surname>
<given-names>M</given-names>
</name>
<name><surname>Cöster</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Using bag-of-concepts to improve the performance of support vector machines in text categorization</article-title>
<conf-name>Proceedings of the 20th international conference on computational linguistics</conf-name>
<year>2004</year>
<comment><italic>Available at <uri xlink:href="http://dl.acm.org/citation.cfm?id=1220425">http://dl.acm.org/citation.cfm?id=1220425</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-32"><label>Salton, Wong & Yang (1975)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Salton</surname>
<given-names>G</given-names>
</name>
<name><surname>Wong</surname>
<given-names>A</given-names>
</name>
<name><surname>Yang</surname>
<given-names>CS</given-names>
</name>
</person-group>
<article-title>A vector space model for automatic indexing</article-title>
<source>Communications of the ACM</source>
<issue>11</issue>
<year>1975</year>
<volume>18</volume>
<fpage>613</fpage>
<lpage>620</lpage>
<pub-id pub-id-type="doi">10.1145/361219.361220</pub-id>
</element-citation>
</ref>
<ref id="ref-33"><label>Schapire & Singer (2000)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Schapire</surname>
<given-names>RE</given-names>
</name>
<name><surname>Singer</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>BoosTexter: a boosting-based system for text categorization</article-title>
<source>Machine Learning</source>
<year>2000</year>
<volume>39</volume>
<fpage>135</fpage>
<lpage>168</lpage>
<pub-id pub-id-type="doi">10.1023/A:1007649029923</pub-id>
</element-citation>
</ref>
<ref id="ref-34"><label>Sebastiani (2002)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sebastiani</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Machine learning in automated text categorization</article-title>
<source>ACM Computing Surveys</source>
<issue>1</issue>
<year>2002</year>
<volume>34</volume>
<fpage>1</fpage>
<lpage>47</lpage>
<pub-id pub-id-type="doi">10.1145/505282.505283</pub-id>
</element-citation>
</ref>
<ref id="ref-35"><label>Settles (2010)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Settles</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Active learning literature survey</article-title>
<source>Machine Learning</source>
<issue>2</issue>
<year>2010</year>
<volume>15</volume>
<fpage>201</fpage>
<lpage>221</lpage>
</element-citation>
</ref>
<ref id="ref-36"><label>Stock (2010)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Stock</surname>
<given-names>WG</given-names>
</name>
</person-group>
<article-title>Concepts and semantic relations in information science</article-title>
<source>Journal of the American Society for Information Science and Technology</source>
<issue>10</issue>
<year>2010</year>
<volume>61</volume>
<fpage>1951</fpage>
<lpage>1969</lpage>
<pub-id pub-id-type="doi">10.1002/asi.21382</pub-id>
</element-citation>
</ref>
<ref id="ref-37"><label>Täckström (2005)</label>
<element-citation publication-type="thesis"><person-group person-group-type="author"><name><surname>Täckström</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>An evaluation of bag-of-concepts representations in automatic text classification</article-title>
<source>Doctoral dissertation, KTH</source>
<year>2005</year>
<fpage>1</fpage>
<lpage>72</lpage>
<comment><italic>Available at <uri xlink:href="http://www.nada.kth.se/utbildning/grukth/exjobb/rapportlistor/2005/rapporter05/tackstrom_oscar_05150.pdf">http://www.nada.kth.se/utbildning/grukth/exjobb/rapportlistor/2005/rapporter05/tackstrom_oscar_05150.pdf</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-38"><label>Tsao, Chen & Wang (2013)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Tsao</surname>
<given-names>Y</given-names>
</name>
<name><surname>Chen</surname>
<given-names>KY</given-names>
</name>
<name><surname>Wang</surname>
<given-names>HM</given-names>
</name>
</person-group>
<article-title>Semantic naïve Bayes classifier for document classification</article-title>
<conf-name>International joint conference on natural language processing</conf-name>
<year>2013</year>
<fpage>1117</fpage>
<lpage>1123</lpage>
<comment><italic>Available at <uri xlink:href="http://www.aclweb.org/anthology/I/I13/I13-1158.pdf">http://www.aclweb.org/anthology/I/I13/I13-1158.pdf</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-39"><label>Tsoumakas & Katakis (2007)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tsoumakas</surname>
<given-names>G</given-names>
</name>
<name><surname>Katakis</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Multi-label classification: an overview</article-title>
<source>International Journal of Data Warehousing and Mining</source>
<year>2007</year>
<volume>3</volume>
<fpage>1</fpage>
<lpage>13</lpage>
<pub-id pub-id-type="doi">10.4018/jdwm.2007070101</pub-id>
</element-citation>
</ref>
<ref id="ref-40"><label>Vivaldi & Rodríguez (2010)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Vivaldi</surname>
<given-names>J</given-names>
</name>
<name><surname>Rodríguez</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Using Wikipedia for term extraction in the biomedical domain: first experiences</article-title>
<source>Procesamiento del Lenguaje Natural</source>
<year>2010</year>
<volume>45</volume>
<fpage>251</fpage>
<lpage>254</lpage>
</element-citation>
</ref>
<ref id="ref-41"><label>Wang et al. (2008)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname>
<given-names>P</given-names>
</name>
<name><surname>Hu</surname>
<given-names>J</given-names>
</name>
<name><surname>Zeng</surname>
<given-names>H-J</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Using Wikipedia knowledge to improve text classification</article-title>
<source>Knowledge and Information Systems</source>
<issue>3</issue>
<year>2008</year>
<volume>19</volume>
<fpage>265</fpage>
<lpage>281</lpage>
<pub-id pub-id-type="doi">10.1007/s10115-008-0152-4</pub-id>
</element-citation>
</ref>
<ref id="ref-42"><label>Wang et al. (2007)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Wang</surname>
<given-names>P</given-names>
</name>
<name><surname>Hu</surname>
<given-names>J</given-names>
</name>
<name><surname>Zeng</surname>
<given-names>H-J</given-names>
</name>
<name><surname>Chen</surname>
<given-names>L</given-names>
</name>
<name><surname>Chen</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Improving text classification by using encyclopedia knowledge</article-title>
<conf-name>Seventh IEEE international conference on data mining (ICDM 2007)</conf-name>
<year>2007</year>
<fpage>332</fpage>
<lpage>341</lpage>
</element-citation>
</ref>
<ref id="ref-43"><label>Yang (1999)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>An evaluation of statistical approaches to text categorization</article-title>
<source>Information Retrieval</source>
<issue>1</issue>
<year>1999</year>
<volume>1</volume>
<fpage>69</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="doi">10.1023/A:1009982220290</pub-id>
</element-citation>
</ref>
<ref id="ref-44"><label>Yetisgen-Yildiz & Pratt (2005)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yetisgen-Yildiz</surname>
<given-names>M</given-names>
</name>
<name><surname>Pratt</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>The effect of feature representation on MEDLINE document classification</article-title>
<source>AMIA Annual Symposium Proceedings</source>
<year>2005</year>
<fpage>849</fpage>
<lpage>853</lpage>
<pub-id pub-id-type="pmid">16779160</pub-id>
</element-citation>
</ref>
<ref id="ref-45"><label>Zhang, Phan & Horiguchi (2008)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name><surname>Phan</surname>
<given-names>X-H</given-names>
</name>
<name><surname>Horiguchi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>An efficient feature selection using hidden topic in text categorization</article-title>
<conf-name>Advanced information networking and applications-workshops</conf-name>
<year>2008</year>
</element-citation>
</ref>
<ref id="ref-46"><label>Zheng, McLean & Lu (2006)</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname>
<given-names>B</given-names>
</name>
<name><surname>McLean</surname>
<given-names>DC</given-names>
</name>
<name><surname>Lu</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Identifying biological concepts from a protein-related corpus with a probabilistic topic model</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>58</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-7-58</pub-id>
<pub-id pub-id-type="pmid">16466569</pub-id>
</element-citation>
</ref>
<ref id="ref-47"><label>Zhou, Zhang & Hu (2006)</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname>
<given-names>X</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name><surname>Hu</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>MaxMatcher: biological concept extraction using approximate dictionary lookup</article-title>
<source>PRICAI 2006: trends in artificial intelligence</source>
<series>Lecture notes in computer science</series>
<volume>vol. 4099</volume>
<year>2006</year>
<publisher-name>Springer</publisher-name>
<fpage>1145</fpage>
<lpage>1149</lpage>
<comment><italic>Available at <uri xlink:href="http://link.springer.com/chapter/10.1007%2F978-3-540-36668-3_150#page-1">http://link.springer.com/chapter/10.1007%2F978-3-540-36668-3_150#page-1</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-48"><label>Zhou, Zhang & Hu (2008a)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Zhou</surname>
<given-names>X</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name><surname>Hu</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Semantic smoothing for Bayesian text classification with small training data</article-title>
<conf-name>Proceedings of the international conference on data mining</conf-name>
<year>2008a</year>
<fpage>289</fpage>
<lpage>300</lpage>
<comment><italic>Available at <uri xlink:href="http://epubs.siam.org/doi/abs/10.1137/1.9781611972788.2">http://epubs.siam.org/doi/abs/10.1137/1.9781611972788.2</uri>
</italic>
</comment>
</element-citation>
</ref>
<ref id="ref-49"><label>Zhou, Zhang & Hu (2008b)</label>
<element-citation publication-type="confproc"><person-group person-group-type="author"><name><surname>Zhou</surname>
<given-names>X</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name><surname>Hu</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Semantic smoothing for Bayesian text classification with small training data</article-title>
<conf-name>Proceedings of the SIAM international conference on data mining, SDM 2008</conf-name>
<year>2008b</year>
<fpage>289</fpage>
<lpage>300</lpage>
<comment><italic>Avaialble at <uri xlink:href="http://epubs.siam.org/doi/abs/10.1137/1.9781611972788.26">http://epubs.siam.org/doi/abs/10.1137/1.9781611972788.26</uri>
</italic>
</comment>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/TelematiV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000039 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000039 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    TelematiV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4592155
   |texte=   Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:26468436" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a TelematiV1

This area was generated with Dilib version V0.6.31.
Data generation: Thu Nov 2 16:09:04 2017. Site generation: Sun Mar 10 16:42:28 2024

	Serveur d'exploration sur la télématique
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la télématique

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki