Serveur d'exploration autour du libre accès en Belgique

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000098 ( Pmc/Corpus ); précédent : 0000979; suivant : 0000990 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">
<italic>Beegle:</italic>
from literature mining to disease-gene discovery</title>
<author>
<name sortKey="Elshal, Sarah" sort="Elshal, Sarah" uniqKey="Elshal S" first="Sarah" last="Elshal">Sarah Elshal</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tranchevent, Leon Charles" sort="Tranchevent, Leon Charles" uniqKey="Tranchevent L" first="Léon-Charles" last="Tranchevent">Léon-Charles Tranchevent</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF3">Inserm UMR-S1052, CNRS UMR5286, Cancer Research Centre of Lyon, Lyon, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF4">Université de Lyon 1, Villeurbanne, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF5">Centre Léon Bérard, Lyon, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sifrim, Alejandro" sort="Sifrim, Alejandro" uniqKey="Sifrim A" first="Alejandro" last="Sifrim">Alejandro Sifrim</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF6">Wellcome Trust Genome Campus, Hinxton, Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ardeshirdavani, Amin" sort="Ardeshirdavani, Amin" uniqKey="Ardeshirdavani A" first="Amin" last="Ardeshirdavani">Amin Ardeshirdavani</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Davis, Jesse" sort="Davis, Jesse" uniqKey="Davis J" first="Jesse" last="Davis">Jesse Davis</name>
<affiliation>
<nlm:aff id="AFF7">Department of Computer Science (DTAI), KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Moreau, Yves" sort="Moreau, Yves" uniqKey="Moreau Y" first="Yves" last="Moreau">Yves Moreau</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26384564</idno>
<idno type="pmc">4737179</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4737179</idno>
<idno type="RBID">PMC:4737179</idno>
<idno type="doi">10.1093/nar/gkv905</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000098</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">
<italic>Beegle:</italic>
from literature mining to disease-gene discovery</title>
<author>
<name sortKey="Elshal, Sarah" sort="Elshal, Sarah" uniqKey="Elshal S" first="Sarah" last="Elshal">Sarah Elshal</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tranchevent, Leon Charles" sort="Tranchevent, Leon Charles" uniqKey="Tranchevent L" first="Léon-Charles" last="Tranchevent">Léon-Charles Tranchevent</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF3">Inserm UMR-S1052, CNRS UMR5286, Cancer Research Centre of Lyon, Lyon, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF4">Université de Lyon 1, Villeurbanne, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF5">Centre Léon Bérard, Lyon, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sifrim, Alejandro" sort="Sifrim, Alejandro" uniqKey="Sifrim A" first="Alejandro" last="Sifrim">Alejandro Sifrim</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF6">Wellcome Trust Genome Campus, Hinxton, Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ardeshirdavani, Amin" sort="Ardeshirdavani, Amin" uniqKey="Ardeshirdavani A" first="Amin" last="Ardeshirdavani">Amin Ardeshirdavani</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Davis, Jesse" sort="Davis, Jesse" uniqKey="Davis J" first="Jesse" last="Davis">Jesse Davis</name>
<affiliation>
<nlm:aff id="AFF7">Department of Computer Science (DTAI), KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Moreau, Yves" sort="Moreau, Yves" uniqKey="Moreau Y" first="Yves" last="Moreau">Yves Moreau</name>
<affiliation>
<nlm:aff id="AFF1">Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="AFF2">iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Nucleic Acids Research</title>
<idno type="ISSN">0305-1048</idno>
<idno type="eISSN">1362-4962</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Disease-gene identification is a challenging process that has multiple applications within functional genomics and personalized medicine. Typically, this process involves both finding genes known to be associated with the disease (through literature search) and carrying out preliminary experiments or screens (e.g. linkage or association studies, copy number analyses, expression profiling) to determine a set of promising candidates for experimental validation. This requires extensive time and monetary resources. We describe
<italic>Beegle</italic>
, an online search and discovery engine that attempts to simplify this process by automating the typical approaches. It starts by mining the literature to quickly extract a set of genes known to be linked with a given query, then it integrates the learning methodology of
<italic>Endeavour</italic>
(a gene prioritization tool) to train a genomic model and rank a set of candidate genes to generate novel hypotheses. In a realistic evaluation setup,
<italic>Beegle</italic>
has an average recall of 84% in the top 100 returned genes as a search engine, which improves the discovery engine by 12.6% in the top 5% prioritized genes.
<italic>Beegle</italic>
is publicly available at
<ext-link ext-link-type="uri" xlink:href="http://beegle.esat.kuleuven.be/">http://beegle.esat.kuleuven.be/</ext-link>
.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Perez Iratxeta, C" uniqKey="Perez Iratxeta C">C. Perez-Iratxeta</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P. Bork</name>
</author>
<author>
<name sortKey="Andrade, M A" uniqKey="Andrade M">M.A. Andrade</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Moody, S E" uniqKey="Moody S">S.E. Moody</name>
</author>
<author>
<name sortKey="Boehm, J S" uniqKey="Boehm J">J.S. Boehm</name>
</author>
<author>
<name sortKey="Barbie, D A" uniqKey="Barbie D">D.A. Barbie</name>
</author>
<author>
<name sortKey="Hahn, W C" uniqKey="Hahn W">W.C. Hahn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="He, M" uniqKey="He M">M. He</name>
</author>
<author>
<name sortKey="Xu, M" uniqKey="Xu M">M. Xu</name>
</author>
<author>
<name sortKey="Zhang, B" uniqKey="Zhang B">B. Zhang</name>
</author>
<author>
<name sortKey="Liang, J" uniqKey="Liang J">J. Liang</name>
</author>
<author>
<name sortKey="Chen, P" uniqKey="Chen P">P. Chen</name>
</author>
<author>
<name sortKey="Lee, J Y" uniqKey="Lee J">J.Y. Lee</name>
</author>
<author>
<name sortKey="Johnson, T A" uniqKey="Johnson T">T.A. Johnson</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
<author>
<name sortKey="Yang, X" uniqKey="Yang X">X. Yang</name>
</author>
<author>
<name sortKey="Dai, J" uniqKey="Dai J">J. Dai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qiu, Y H" uniqKey="Qiu Y">Y.H. Qiu</name>
</author>
<author>
<name sortKey="Deng, F Y" uniqKey="Deng F">F.Y. Deng</name>
</author>
<author>
<name sortKey="Li, M J" uniqKey="Li M">M.J. Li</name>
</author>
<author>
<name sortKey="Lei, S F" uniqKey="Lei S">S.F. Lei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Penttil, S" uniqKey="Penttil S">S. Penttilä</name>
</author>
<author>
<name sortKey="Jokela, M" uniqKey="Jokela M">M. Jokela</name>
</author>
<author>
<name sortKey="Bouquin, H" uniqKey="Bouquin H">H. Bouquin</name>
</author>
<author>
<name sortKey="Saukkonen, A M" uniqKey="Saukkonen A">A.M. Saukkonen</name>
</author>
<author>
<name sortKey="Toivanen, J" uniqKey="Toivanen J">J. Toivanen</name>
</author>
<author>
<name sortKey="Udd, B" uniqKey="Udd B">B. Udd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tranchevent, L C" uniqKey="Tranchevent L">L.C. Tranchevent</name>
</author>
<author>
<name sortKey="Capdevila, F B" uniqKey="Capdevila F">F.B. Capdevila</name>
</author>
<author>
<name sortKey="Nitsch, D" uniqKey="Nitsch D">D. Nitsch</name>
</author>
<author>
<name sortKey="De Moor, B" uniqKey="De Moor B">B. De Moor</name>
</author>
<author>
<name sortKey="De Causmaecker, P" uniqKey="De Causmaecker P">P. De Causmaecker</name>
</author>
<author>
<name sortKey="Moreau, Y" uniqKey="Moreau Y">Y. Moreau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Doncheva, N T" uniqKey="Doncheva N">N.T. Doncheva</name>
</author>
<author>
<name sortKey="Kacprowski, T" uniqKey="Kacprowski T">T. Kacprowski</name>
</author>
<author>
<name sortKey="Albrecht, M" uniqKey="Albrecht M">M. Albrecht</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Piro, R M" uniqKey="Piro R">R.M. Piro</name>
</author>
<author>
<name sortKey="Di Cunto, F" uniqKey="Di Cunto F">F. Di Cunto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Moreau, Y" uniqKey="Moreau Y">Y. Moreau</name>
</author>
<author>
<name sortKey="Tranchevent, L C" uniqKey="Tranchevent L">L.C. Tranchevent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bornigen, D" uniqKey="Bornigen D">D. Börnigen</name>
</author>
<author>
<name sortKey="Tranchevent, L C" uniqKey="Tranchevent L">L.C. Tranchevent</name>
</author>
<author>
<name sortKey="Bonachela Capdevil, F" uniqKey="Bonachela Capdevil F">F. Bonachela-Capdevil</name>
</author>
<author>
<name sortKey="Devriendt, K" uniqKey="Devriendt K">K. Devriendt</name>
</author>
<author>
<name sortKey="De Moor, B" uniqKey="De Moor B">B. De Moor</name>
</author>
<author>
<name sortKey="De Causmaecker, P" uniqKey="De Causmaecker P">P. De Causmaecker</name>
</author>
<author>
<name sortKey="Moreau, Y" uniqKey="Moreau Y">Y. Moreau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tranchevent, L C" uniqKey="Tranchevent L">L.C. Tranchevent</name>
</author>
<author>
<name sortKey="Barriot, R" uniqKey="Barriot R">R. Barriot</name>
</author>
<author>
<name sortKey="Yu, S" uniqKey="Yu S">S. Yu</name>
</author>
<author>
<name sortKey="Van Vooren, S" uniqKey="Van Vooren S">S. Van Vooren</name>
</author>
<author>
<name sortKey="Van Loo, P" uniqKey="Van Loo P">P. Van Loo</name>
</author>
<author>
<name sortKey="Coessens, B" uniqKey="Coessens B">B. Coessens</name>
</author>
<author>
<name sortKey="De Moor, B" uniqKey="De Moor B">B. De Moor</name>
</author>
<author>
<name sortKey="Aerts, S" uniqKey="Aerts S">S. Aerts</name>
</author>
<author>
<name sortKey="Moreau, Y" uniqKey="Moreau Y">Y. Moreau</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Seelow, D" uniqKey="Seelow D">D. Seelow</name>
</author>
<author>
<name sortKey="Schwarz, J M" uniqKey="Schwarz J">J.M. Schwarz</name>
</author>
<author>
<name sortKey="Schuelke, M" uniqKey="Schuelke M">M. Schuelke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J. Chen</name>
</author>
<author>
<name sortKey="Bardes, E E" uniqKey="Bardes E">E.E. Bardes</name>
</author>
<author>
<name sortKey="Aronow, B J" uniqKey="Aronow B">B.J. Aronow</name>
</author>
<author>
<name sortKey="Jegga, A G" uniqKey="Jegga A">A.G. Jegga</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Amberger, J" uniqKey="Amberger J">J. Amberger</name>
</author>
<author>
<name sortKey="Bocchini, C" uniqKey="Bocchini C">C. Bocchini</name>
</author>
<author>
<name sortKey="Hamosh, A" uniqKey="Hamosh A">A. Hamosh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Becker, K G" uniqKey="Becker K">K.G. Becker</name>
</author>
<author>
<name sortKey="Barnes, K C" uniqKey="Barnes K">K.C. Barnes</name>
</author>
<author>
<name sortKey="Bright, T J" uniqKey="Bright T">T.J. Bright</name>
</author>
<author>
<name sortKey="Wang, S A" uniqKey="Wang S">S.A. Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jensen, J L" uniqKey="Jensen J">J.L. Jensen</name>
</author>
<author>
<name sortKey="Saric, J" uniqKey="Saric J">J. Saric</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P. Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jelier, R" uniqKey="Jelier R">R. Jelier</name>
</author>
<author>
<name sortKey="Jenster, G" uniqKey="Jenster G">G. Jenster</name>
</author>
<author>
<name sortKey="Dorssers, L C" uniqKey="Dorssers L">L.C. Dorssers</name>
</author>
<author>
<name sortKey="Wouters, B J" uniqKey="Wouters B">B.J. Wouters</name>
</author>
<author>
<name sortKey="Hendriksen, P J" uniqKey="Hendriksen P">P.J. Hendriksen</name>
</author>
<author>
<name sortKey="Mons, B" uniqKey="Mons B">B. Mons</name>
</author>
<author>
<name sortKey="Delwel, R" uniqKey="Delwel R">R. Delwel</name>
</author>
<author>
<name sortKey="Kors, J A" uniqKey="Kors J">J.A. Kors</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jelier, R" uniqKey="Jelier R">R. Jelier</name>
</author>
<author>
<name sortKey="Schuemie, M J" uniqKey="Schuemie M">M.J. Schuemie</name>
</author>
<author>
<name sortKey="Roes, P J" uniqKey="Roes P">P.J. Roes</name>
</author>
<author>
<name sortKey="Van Mulligen, E M" uniqKey="Van Mulligen E">E.M. van Mulligen</name>
</author>
<author>
<name sortKey="Kors, J A" uniqKey="Kors J">J.A. Kors</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jelier, R" uniqKey="Jelier R">R. Jelier</name>
</author>
<author>
<name sortKey="Schuemie, M J" uniqKey="Schuemie M">M.J. Schuemie</name>
</author>
<author>
<name sortKey="Veldhoven, A" uniqKey="Veldhoven A">A. Veldhoven</name>
</author>
<author>
<name sortKey="Dorssers, L C" uniqKey="Dorssers L">L.C. Dorssers</name>
</author>
<author>
<name sortKey="Jenster, G" uniqKey="Jenster G">G. Jenster</name>
</author>
<author>
<name sortKey="Kors, J A" uniqKey="Kors J">J.A. Kors</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Glenisson, P" uniqKey="Glenisson P">P. Glenisson</name>
</author>
<author>
<name sortKey="Coessens, B" uniqKey="Coessens B">B. Coessens</name>
</author>
<author>
<name sortKey="Van Vooren, S" uniqKey="Van Vooren S">S. Van Vooren</name>
</author>
<author>
<name sortKey="Mathys, J" uniqKey="Mathys J">J. Mathys</name>
</author>
<author>
<name sortKey="Moreau, Y" uniqKey="Moreau Y">Y. Moreau</name>
</author>
<author>
<name sortKey="De Moor, B" uniqKey="De Moor B">B. De Moor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fleuren, W W" uniqKey="Fleuren W">W.W. Fleuren</name>
</author>
<author>
<name sortKey="Verhoeven, S" uniqKey="Verhoeven S">S. Verhoeven</name>
</author>
<author>
<name sortKey="Frijters, R" uniqKey="Frijters R">R. Frijters</name>
</author>
<author>
<name sortKey="Heupers, B" uniqKey="Heupers B">B. Heupers</name>
</author>
<author>
<name sortKey="Polman, J" uniqKey="Polman J">J. Polman</name>
</author>
<author>
<name sortKey="Van Schaik, R" uniqKey="Van Schaik R">R. van Schaik</name>
</author>
<author>
<name sortKey="De Vlieg, J" uniqKey="De Vlieg J">J. de Vlieg</name>
</author>
<author>
<name sortKey="Alkema, W" uniqKey="Alkema W">W. Alkema</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cheung, W A" uniqKey="Cheung W">W.A. Cheung</name>
</author>
<author>
<name sortKey="Ouellette, B F" uniqKey="Ouellette B">B.F. Ouellette</name>
</author>
<author>
<name sortKey="Wasserman, W W" uniqKey="Wasserman W">W.W. Wasserman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hristovski, D" uniqKey="Hristovski D">D. Hristovski</name>
</author>
<author>
<name sortKey="Friedman, C" uniqKey="Friedman C">C. Friedman</name>
</author>
<author>
<name sortKey="Rindflesch, T C" uniqKey="Rindflesch T">T.C. Rindflesch</name>
</author>
<author>
<name sortKey="Peterlin, B" uniqKey="Peterlin B">B. Peterlin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fontaine, J F" uniqKey="Fontaine J">J.F. Fontaine</name>
</author>
<author>
<name sortKey="Priller, F" uniqKey="Priller F">F. Priller</name>
</author>
<author>
<name sortKey="Barbosa Silva, A" uniqKey="Barbosa Silva A">A. Barbosa-Silva</name>
</author>
<author>
<name sortKey="Andrade Navarro, M A" uniqKey="Andrade Navarro M">M.A. Andrade-Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Aronson, A R" uniqKey="Aronson A">A.R. Aronson</name>
</author>
<author>
<name sortKey="Lang, F M" uniqKey="Lang F">F.M. Lang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bodenreider, O" uniqKey="Bodenreider O">O. Bodenreider</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manning, C D" uniqKey="Manning C">C.D. Manning</name>
</author>
<author>
<name sortKey="Raghavan, P" uniqKey="Raghavan P">P. Raghavan</name>
</author>
<author>
<name sortKey="Schutze, H" uniqKey="Schutze H">H. Schutze</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Aerts, S" uniqKey="Aerts S">S. Aerts</name>
</author>
<author>
<name sortKey="Lambrechts, D" uniqKey="Lambrechts D">D. Lambrechts</name>
</author>
<author>
<name sortKey="Maity, S" uniqKey="Maity S">S. Maity</name>
</author>
<author>
<name sortKey="Van Loo, P" uniqKey="Van Loo P">P. Van Loo</name>
</author>
<author>
<name sortKey="Coessens, B" uniqKey="Coessens B">B. Coessens</name>
</author>
<author>
<name sortKey="De Smet, F" uniqKey="De Smet F">F. De Smet</name>
</author>
<author>
<name sortKey="Tranchevent, L C" uniqKey="Tranchevent L">L.C. Tranchevent</name>
</author>
<author>
<name sortKey="De Moor, B" uniqKey="De Moor B">B. De Moor</name>
</author>
<author>
<name sortKey="Marynen, P" uniqKey="Marynen P">P. Marynen</name>
</author>
<author>
<name sortKey="Hassan, B" uniqKey="Hassan B">B. Hassan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lopez Bigas, N" uniqKey="Lopez Bigas N">N. Lopez-Bigas</name>
</author>
<author>
<name sortKey="Ouzounis, C A" uniqKey="Ouzounis C">C.A. Ouzounis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lukk, M" uniqKey="Lukk M">M. Lukk</name>
</author>
<author>
<name sortKey="Kapushesky, M" uniqKey="Kapushesky M">M. Kapushesky</name>
</author>
<author>
<name sortKey="Nikkil, J" uniqKey="Nikkil J">J. Nikkilä</name>
</author>
<author>
<name sortKey="Parkinson, H" uniqKey="Parkinson H">H. Parkinson</name>
</author>
<author>
<name sortKey="Goncalves, A" uniqKey="Goncalves A">A. Goncalves</name>
</author>
<author>
<name sortKey="Huber, W" uniqKey="Huber W">W. Huber</name>
</author>
<author>
<name sortKey="Ukkonen, E" uniqKey="Ukkonen E">E. Ukkonen</name>
</author>
<author>
<name sortKey="Brazma, A" uniqKey="Brazma A">A. Brazma</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sifrim, A" uniqKey="Sifrim A">A. Sifrim</name>
</author>
<author>
<name sortKey="Popovic, D" uniqKey="Popovic D">D. Popovic</name>
</author>
<author>
<name sortKey="Tranchevent, L C" uniqKey="Tranchevent L">L.C. Tranchevent</name>
</author>
<author>
<name sortKey="Ardeshirdavani, A" uniqKey="Ardeshirdavani A">A. Ardeshirdavani</name>
</author>
<author>
<name sortKey="Sakai, R" uniqKey="Sakai R">R. Sakai</name>
</author>
<author>
<name sortKey="Konings, P" uniqKey="Konings P">P. Konings</name>
</author>
<author>
<name sortKey="Vermeesch, J R" uniqKey="Vermeesch J">J.R. Vermeesch</name>
</author>
<author>
<name sortKey="Aerts, J" uniqKey="Aerts J">J. Aerts</name>
</author>
<author>
<name sortKey="De Moor, B" uniqKey="De Moor B">B. De Moor</name>
</author>
<author>
<name sortKey="Moreau, Y" uniqKey="Moreau Y">Y. Moreau</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="iso-abbrev">Nucleic Acids Res</journal-id>
<journal-id journal-id-type="hwp">nar</journal-id>
<journal-id journal-id-type="publisher-id">nar</journal-id>
<journal-title-group>
<journal-title>Nucleic Acids Research</journal-title>
</journal-title-group>
<issn pub-type="ppub">0305-1048</issn>
<issn pub-type="epub">1362-4962</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26384564</article-id>
<article-id pub-id-type="pmc">4737179</article-id>
<article-id pub-id-type="doi">10.1093/nar/gkv905</article-id>
<article-categories>
<subj-group subj-group-type="hwp-journal-coll">
<subject>7</subject>
<subject>24</subject>
</subj-group>
<subj-group subj-group-type="heading">
<subject>Methods Online</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>
<italic>Beegle:</italic>
from literature mining to disease-gene discovery</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>ElShal</surname>
<given-names>Sarah</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="AFF2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="COR1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tranchevent</surname>
<given-names>Léon-Charles</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="AFF2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="AFF3">
<sup>3</sup>
</xref>
<xref ref-type="aff" rid="AFF4">
<sup>4</sup>
</xref>
<xref ref-type="aff" rid="AFF5">
<sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sifrim</surname>
<given-names>Alejandro</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="AFF2">
<sup>2</sup>
</xref>
<xref ref-type="aff" rid="AFF6">
<sup>6</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ardeshirdavani</surname>
<given-names>Amin</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="AFF2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Davis</surname>
<given-names>Jesse</given-names>
</name>
<xref ref-type="aff" rid="AFF7">
<sup>7</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Moreau</surname>
<given-names>Yves</given-names>
</name>
<xref ref-type="aff" rid="AFF1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="AFF2">
<sup>2</sup>
</xref>
</contrib>
<aff id="AFF1">
<label>1</label>
Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium</aff>
<aff id="AFF2">
<label>2</label>
iMinds Future Health Department, KU Leuven, Leuven 3001, Belgium</aff>
<aff id="AFF3">
<label>3</label>
Inserm UMR-S1052, CNRS UMR5286, Cancer Research Centre of Lyon, Lyon, France</aff>
<aff id="AFF4">
<label>4</label>
Université de Lyon 1, Villeurbanne, France</aff>
<aff id="AFF5">
<label>5</label>
Centre Léon Bérard, Lyon, France</aff>
<aff id="AFF6">
<label>6</label>
Wellcome Trust Genome Campus, Hinxton, Wellcome Trust Sanger Institute, Cambridge CB10 1SA, UK</aff>
<aff id="AFF7">
<label>7</label>
Department of Computer Science (DTAI), KU Leuven, Leuven 3001, Belgium</aff>
</contrib-group>
<author-notes>
<corresp id="COR1">
<label>*</label>
To whom correspondence should be addressed. Tel: +32 16 32 73 86; Fax: +32 16 32 19 70; Email:
<email>sarah.elshal@esat.kuleuven.be</email>
</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>29</day>
<month>1</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="epub">
<day>17</day>
<month>9</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>17</day>
<month>9</month>
<year>2015</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>44</volume>
<issue>2</issue>
<fpage>e18</fpage>
<lpage>e18</lpage>
<history>
<date date-type="accepted">
<day>29</day>
<month>8</month>
<year>2015</year>
</date>
<date date-type="rev-recd">
<day>25</day>
<month>8</month>
<year>2015</year>
</date>
<date date-type="received">
<day>03</day>
<month>3</month>
<year>2015</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.</copyright-statement>
<copyright-year>2016</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>
), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
<email>journals.permissions@oup.com</email>
</license-p>
</license>
</permissions>
<self-uri xlink:title="pdf" xlink:href="gkv905.pdf"></self-uri>
<abstract>
<p>Disease-gene identification is a challenging process that has multiple applications within functional genomics and personalized medicine. Typically, this process involves both finding genes known to be associated with the disease (through literature search) and carrying out preliminary experiments or screens (e.g. linkage or association studies, copy number analyses, expression profiling) to determine a set of promising candidates for experimental validation. This requires extensive time and monetary resources. We describe
<italic>Beegle</italic>
, an online search and discovery engine that attempts to simplify this process by automating the typical approaches. It starts by mining the literature to quickly extract a set of genes known to be linked with a given query, then it integrates the learning methodology of
<italic>Endeavour</italic>
(a gene prioritization tool) to train a genomic model and rank a set of candidate genes to generate novel hypotheses. In a realistic evaluation setup,
<italic>Beegle</italic>
has an average recall of 84% in the top 100 returned genes as a search engine, which improves the discovery engine by 12.6% in the top 5% prioritized genes.
<italic>Beegle</italic>
is publicly available at
<ext-link ext-link-type="uri" xlink:href="http://beegle.esat.kuleuven.be/">http://beegle.esat.kuleuven.be/</ext-link>
.</p>
</abstract>
<counts>
<page-count count="8"></page-count>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>cover-date</meta-name>
<meta-value>29 January 2016</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="SEC1">
<title>INTRODUCTION</title>
<p>Determining which genes cause which diseases is an important yet challenging problem (
<xref rid="B1" ref-type="bibr">1</xref>
). It has a variety of applications that range from DNA screening and early diagnosis, to gene sequence analysis and drug development (
<xref rid="B2" ref-type="bibr">2</xref>
). However, it is resource intensive both in terms of time investment and monetary cost. Traditionally, disease-gene identification is approached manually and is conducted in two phases. The first phase involves narrowing down a large set of candidate genes (e.g. the whole genome) into a significantly smaller set of genes that has a high probability of containing a disease causing gene. Different ways exist to tackle this phase, such as linkage analysis, genome sequencing and association studies (
<xref rid="B3" ref-type="bibr">3</xref>
<xref rid="B5" ref-type="bibr">5</xref>
). Then, in the second phase, experts experimentally evaluate the selected genes to confirm which of those candidates are truly disease causing. This involves wet lab experimentation for every selected gene. Consequently, an important advancement in this field has been the development of computational methods that can help the experts address the first phase of this process by automatically prioritizing a set of candidate genes for final experimental validation to maximize the yield of the second phase.</p>
<p>Many computational methods for human gene prioritization have been developed, and several review articles exist that describe their approaches, their differences, and how they can be used in practice (
<xref rid="B6" ref-type="bibr">6</xref>
<xref rid="B9" ref-type="bibr">9</xref>
). These methods differ in their expected inputs, their returned outputs and their prioritization strategies. A previous study compared the performance of eight of these methods that are publicly available as web-based tools (
<xref rid="B10" ref-type="bibr">10</xref>
). The evaluation setup used a realistic scenario where data prior to a certain date were used to generate the gene prioritizations and then the predictions were compared to disease-gene annotations discovered later. The results showed that
<italic>Endeavour</italic>
(
<xref rid="B11" ref-type="bibr">11</xref>
),
<italic>GeneDistiller</italic>
(
<xref rid="B12" ref-type="bibr">12</xref>
) and
<italic>ToppGene</italic>
(
<xref rid="B13" ref-type="bibr">13</xref>
) performed best when measuring the true-positive rates among the top returned genes. All three tools require a set of training genes (genes that are known to be linked to the disease of interest) or keywords (describing the disease under study) as input, which is then used to infer several models (according to different genomic sources) and rank a set of candidate genes based on the learned models. These approaches all require hand-selected input to compute the gene prioritizations. Normally, experts undertake this challenging and time-consuming process by collecting information from (i) generic or disease specific association databases (e.g. OMIM (
<xref rid="B14" ref-type="bibr">14</xref>
), GAD (
<xref rid="B15" ref-type="bibr">15</xref>
)), (ii) relevant literature or (iii) their own data and expertise (including for instance relevant patient records). Therefore, a tool that would support automatic identification of the training genes as an initial step to candidate gene prioritization would provide better usability to researchers interested in candidate gene prioritization (
<xref rid="B16" ref-type="bibr">16</xref>
).</p>
<p>Text mining is one popular strategy for automatically associating biomedical entities with each other (
<xref rid="B17" ref-type="bibr">17</xref>
<xref rid="B24" ref-type="bibr">24</xref>
).
<italic>MeSHOP</italic>
(
<xref rid="B22" ref-type="bibr">22</xref>
) and
<italic>Genie</italic>
(
<xref rid="B24" ref-type="bibr">24</xref>
) are two examples that associate genes with diseases. These tools can be used to rank a set of genes given a disease query, hence they could be used by the genetic experts to automatically search for the training input required by the gene prioritization tools. However these tools do not distinguish known gene associations from unknown ones. Hence, for a user who is interested in selecting a set of training genes prior to conducting a gene prioritization process, the existing tools are limited and require post processing to filter out the resulting gene associations.</p>
<p>This article presents
<italic>Beegle</italic>
, an online tool for disease-gene prioritization. First,
<italic>Beegle</italic>
mines the literature to automatically identify a list of genes known to be linked with a given query. Next,
<italic>Beegle</italic>
employs
<italic>Endeavour</italic>
, which integrates multiple genomic models to automatically rank a set of candidate genes (e.g. the human genome) according to a selected set of genes identified in the first step. We evaluated
<italic>Beegle</italic>
in two different ways. First, we evaluated its ability to identify known disease-gene associations from the literature. To do this, we have extracted a list of experimentally validated disease-gene associations from the OMIM database. Then, for each disease in this list, we compared the associations returned by
<italic>Beegle</italic>
to the known associations from OMIM. In addition, we compared
<italic>Beegle</italic>
to
<italic>MeSHOP</italic>
, a similar tool, on a subset of the OMIM list using the same experimental setup. Second, we evaluated the suitability of the returned genes to serve as input to train genomic models and generate novel hypothesis. Here, we employed an evaluation methodology that mimics real discovery by using rolled-back data to generate the gene prioritizations, and then by testing on disease-gene associations that were reported after the training data were collected. For this we have used two benchmarks: one based on literature that has already been described (
<xref rid="B10" ref-type="bibr">10</xref>
) and a new one that we have generated from the OMIM database. Our OMIM benchmark is a secondary contribution of this work and is made publicly available as supplementary data so that other researchers can use it to evaluate gene prioritization approaches.</p>
</sec>
<sec sec-type="materials|methods" id="SEC2">
<title>MATERIALS AND METHODS</title>
<sec id="SEC2-1">
<title>The pipeline</title>
<p>An overview of the current methodology of
<italic>Beegle</italic>
is shown in Figure
<xref ref-type="fig" rid="F1">1</xref>
.
<italic>Beegle</italic>
starts from a user query (e.g. a disease) and proceeds in two phases. First, it automatically analyses the literature to identify the genes that are potentially related with the given query. We call this phase the search phase. Second, it uses these genes (identified in the first step) as a seed set that is provided to
<italic>Endeavour</italic>
, which then analyses a number of genomic data sources to finally prioritize a set of candidate genes. We call this phase the discovery phase.</p>
<fig id="F1" orientation="portrait" position="float">
<label>Figure 1.</label>
<caption>
<p>An illustration of
<italic>Beegle's</italic>
pipeline. The disease-gene annotation in
<italic>Beegle</italic>
follows two phases: search and discovery. The search phase involves two text mining techniques, while the discovery phase involves fusing different genomic models.</p>
</caption>
<graphic xlink:href="gkv905fig1"></graphic>
</fig>
</sec>
<sec id="SEC2-2">
<title>Annotating the literature</title>
<p>We use the biomedical database MEDLINE as our source of literature. Offline, we have indexed every MEDLINE abstract using MetaMap (
<xref rid="B25" ref-type="bibr">25</xref>
), which identifies the UMLS concepts (
<xref rid="B26" ref-type="bibr">26</xref>
) within a given abstract text. UMLS is a large, multi-purpose and multi-lingual thesaurus that brings together many health and biomedical vocabularies and standards (e.g. MeSH and SNOMED CT). With this strategy we have associated each MEDLINE abstract to a list of UMLS concepts. This corresponds to 12 308 151 abstract-concepts entries. We report the corresponding list of MEDLINE ids in Supplementary Data 1.</p>
<p>For every gene, we find the list of associated MEDLINE abstracts according to GeneRIF (downloaded in May 2012). Hence, we could generate a UMLS concept profile for all Entrez gene entries (16 493 genes in total, which we report in Supplementary Data 2). The gene profiles are described using 66 883 concepts, which we call the Genes-vocabulary. For every query, we find the list of associated MEDLINE abstracts according to PubMed, where we only consider the top 10 000 PubMed Ids to generate a corresponding UMLS profile. We restrict the query profiles to the concepts that already appear in the Genes-vocabulary.</p>
</sec>
<sec id="SEC2-3">
<title>The search phase</title>
<p>In the search phase,
<italic>Beegle</italic>
applies two text mining approaches to identify the genes most related to a given query. The first one is based on the number of abstracts in which the query and a given gene co-occur. We call this the explicit approach, since it relies on the explicit co-occurrence of a query and a given gene in the literature. We count three values: (i) the number of abstracts associated with the query, (ii) the number of abstracts associated with the gene and (iii) the number of abstracts associated with both the query and the gene. Then we use the Jaccard similarity to measure the strength of the association according to Equation (
<xref ref-type="disp-formula" rid="M1">1</xref>
):
<disp-formula id="M1">
<label>(1)</label>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}\begin{equation*} {\rm similarity}\_{\rm explicit}(q,g) = \frac{{X_{q,g} }}{{N_q + K_g - X_{q,g} }} \end{equation*}\end{document}</tex-math>
</disp-formula>
where
<italic>N</italic>
is the number of abstracts associated with the query,
<italic>K</italic>
is the number of abstracts associated with the gene and
<italic>X</italic>
is the number of abstracts associated with both query and gene.</p>
<p>The higher the similarity score, the more confident we are that the association between the gene and the given query is real.</p>
<p>The second approach is based on the number of concepts shared between a gene and the given query profiles. We call this approach the implicit approach, since it goes one step further and tries to find hidden indirect associations between a gene and the given query. Given the UMLS concept profiles that correspond to both the query and a gene, we apply the TF-IDF (Term Frequency-Inverse Document Frequency) transformation to each term in both the query and gene profiles. This transformation is commonly used in text mining and information retrieval (see (
<xref rid="B27" ref-type="bibr">27</xref>
) for more details), and it consists of two components: the term frequency (TF) and the inverse document frequency (IDF). The TF corresponds to the number of times the concept appears in all abstracts linked to a query or a gene. Intuitively, TF rewards concepts that are frequently associated with the gene. The IDF is calculated as the total number of documents divided by the number of documents that the concept appears in. The IDF gives higher weights to concepts that are commonly associated with a given gene but rare in general, and lower weights to concepts that are associated with many genes. Intuitively, IDF helps identify concepts that are more meaningful or discriminative for a given profile. Supplementary Table S1 illustrates how we compute TF-IDF scores. To measure how similar two concept profiles are, we calculate the cosine similarity between their TF-IDF mappings according to Equation (
<xref ref-type="disp-formula" rid="M2">2</xref>
):
<disp-formula id="M2">
<label>(2)</label>
<tex-math id="M4">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}\begin{equation*} {\rm similarity}\_{\rm implicit}(q,g) = \frac{{\mathop {{\rm (tf} \times \log ({\rm idf}))}\nolimits_q \bullet \mathop {{\rm (tf} \times \log ({\rm idf}))}\nolimits_g }}{{\left\| {\mathop {{\rm (tf} \times \log ({\rm idf}))}\nolimits_q } \right\|\left\| {\mathop {{\rm (tf} \times \log ({\rm idf}))}\nolimits_g } \right\|}} \end{equation*}\end{document}</tex-math>
</disp-formula>
where tf is the term frequency and idf is the inverse document frequency (note that we use the log for scaling purposes).</p>
<p>
<italic>Beegle</italic>
assigns a final gene-query score by using the best rank of both approaches. We call this approach, the combined approach. We show an example in Supplementary Table S2. Hence,
<italic>Beegle</italic>
's output for this phase is an ordered list of the genes identified as being potentially related to the given query according to the literature.</p>
</sec>
<sec id="SEC2-4">
<title>The discovery phase</title>
<p>In the second phase,
<italic>Beegle</italic>
integrates
<italic>Endeavour</italic>
to generate the final gene prioritization for a given query.
<italic>Endeavour</italic>
relies on three inputs: (i) a set of training genes known to be linked to the query under study, (ii) a set of data sources that are used to build the query models using the training genes and (iii) a set of candidate genes to investigate (i.e. to prioritize). Per data source,
<italic>Endeavour</italic>
ranks the candidate genes according to how similar a gene is to the corresponding model, therefore providing one ranked list for each data source. To combine the lists,
<italic>Endeavour</italic>
applies the Order Statistics to produce a single ranking, which is the final prioritization list for the given query. For more details about
<italic>Endeavour</italic>
, we refer the reader to our previous work (
<xref rid="B11" ref-type="bibr">11</xref>
,
<xref rid="B28" ref-type="bibr">28</xref>
). Hence, in this phase,
<italic>Beegle</italic>
uses a training set (that the user selected from the known query-genes retrieved by
<italic>Beegle</italic>
in the search phase) to train the
<italic>Endeavour</italic>
models and rank a user-defined set of candidate genes. Note here that we restrict the data sources used by
<italic>Endeavour</italic>
in this phase to a predefined set that performed best in our experiments.</p>
</sec>
<sec id="SEC2-5">
<title>The web interface</title>
<p>
<italic>Beegle</italic>
is freely available online as a web interface that accepts any combination of one or many biomedical concepts (similar to
<italic>PubMed</italic>
queries such as ‘Alzheimer's disease’, ‘Diabetes and Pregnancy’ and ‘Congenital Heart Defects or Eye Diseases’). Given the input query,
<italic>Beegle</italic>
retrieves the MEDLINE abstracts annotated with any of the query terms (according to PubMed) and generates the corresponding concept profile. Then for every human gene, it computes two scores according to our two text mining approaches (as discussed above). Finally,
<italic>Beegle</italic>
returns an ordered list of the genes most likely to be linked to the given query. Since in this phase we are only aiming to retrieve known annotations, we restrict the output list to genes that are co-mentioned with the query at least once in the literature. The response time for this phase ranges from few seconds to few minutes, depending on whether or not the query has been previously processed.</p>
<p>Next, the user can initiate the discovery phase by: (i) selecting a set of training genes (given the search output list), and (ii) defining a list of candidate genes that are of interest to prioritize. After the literature search, the interface allows users to review the evidence linked to the association by displaying relevant MEDLINE abstracts and concept profiles. This allows for the quick removal of spurious associations (false positives). Any association that might have been missed (false negative) can still be manually added to the list of genes to be used for the prioritization. Also, the user has the option to directly upload a set of training or candidate genes. Here the users have full control of fine tuning the selection process according to their expertise. Next,
<italic>Beegle</italic>
trains the genomic models and prioritizes the user-defined candidate set according to the selected training genes and the different genomic sources that are predefined (see list below). Finally,
<italic>Beegle</italic>
returns the prioritized list in a simple and user-friendly way. The response time for this phase ranges from few minutes (ranking a few candidate genes) up to 20 min (ranking the whole genome). Figure
<xref ref-type="fig" rid="F2">2</xref>
presents three snapshots of the most important screens from the web interface. In addition, we invite the reader to visit the
<italic>Beegle</italic>
website and watch a simple tutorial on how to use system.</p>
<fig id="F2" orientation="portrait" position="float">
<label>Figure 2.</label>
<caption>
<p>The main three screens that
<italic>Beegle</italic>
returns on the web interface. These correspond to (i) the home page, (ii) the results from the search phase and (iii) the results from the discovery phase.</p>
</caption>
<graphic xlink:href="gkv905fig2"></graphic>
</fig>
</sec>
<sec id="SEC2-6">
<title>Data sources and background corpus</title>
<p>The public web interface of
<italic>Beegle</italic>
relies on the 2013 PubMed release ‘downloaded on March 4th 2013’. Then it uses the 2013 (or the latest) version of the following data sources for the discovery phase: Gene Ontology (annotations for gene products) ‘downloaded on May 15th 2013’, Uniprot (protein sequence and functional information) ‘downloaded on June 14th 2013’, Text (MEDLINE literature) ‘downloaded on March 4th 2013’, STRING (genomic data integration) ‘downloaded on June 10th 2013’, Genetic Association Database ‘downloaded on June 14th 2013’, Rat Gene Database ‘downloaded on June 20th 2013’, gene predicted pathogenicity (
<xref rid="B29" ref-type="bibr">29</xref>
) and expression data (
<xref rid="B30" ref-type="bibr">30</xref>
).</p>
</sec>
<sec id="SEC2-7">
<title>The literature-based benchmark</title>
<p>The first benchmark we apply to mimic novel discovery is an existing validation set (
<xref rid="B10" ref-type="bibr">10</xref>
), which was previously used to compare the predictive performance of several publicly available prioritization tools. Briefly, it was manually prepared by reviewing the scientific literature to gather novel disease-gene associations. For more details about the construction of this validation set, we refer the reader to our previous work (
<xref rid="B10" ref-type="bibr">10</xref>
). This set is composed of 34 queries and 42 annotations with at least one novel gene reported in 2010. In this benchmark, the training sets used for prioritization were manually extracted from the literature and dedicated databases.</p>
</sec>
<sec id="SEC2-8">
<title>The OMIM benchmark</title>
<p>OMIM provides a list of disease-gene annotations (6377 annotations in July 2013) based on experimental evidence. The annotation list is a combination of disease-gene entries that contains both confirmed and non-confirmed entries, as well as different mapping evidence. Note that each entry has a list of gene symbols, which includes both official symbols and aliases. Furthermore, many OMIM entries refer to the same disease concept. We refine this list in five steps:
<list list-type="roman-lower">
<list-item>
<p>We remove non-confirmed entries.</p>
</list-item>
<list-item>
<p>We keep only the annotations whose evidence is based on mutations that are located within genes.</p>
</list-item>
<list-item>
<p>We keep only official gene symbols.</p>
</list-item>
<list-item>
<p>We combine disease entries that refer to the same disease concept.</p>
</list-item>
<list-item>
<p>We keep only disease entries that have at least three genes annotated.</p>
</list-item>
</list>
</p>
<p>This results in a refined list of disease-gene annotations (314 diseases and 2654 annotations) based on the OMIM database (version 2013). We call this list the OMIM-search validation set.</p>
<p>To generate a benchmark that allows a validation that mimics novel discovery, we used two versions of OMIM (2010 and 2013) as follows: for both versions, we refined the list as described above. We then compared the output of Step 1 between both versions and discarded the diseases that did not appear in both lists. We also discarded diseases for which we could not find at least one ‘novel’ gene (i.e. reported in the 2013 version and not mentioned in the 2010 one). Finally, to avoid false positives, we verified the resulting entries by manually looking into the scientific literature. Our aim was to make sure the annotated genes were assigned to their correct disease concepts, and have indeed been reported in the correct period. This process resulted in a final OMIM list of 104 diseases that had annotations in both the 2010 and 2013 versions, and for which we had at least one novel gene. This corresponds to a total of 959 annotations reported in 2010, and a total of 277 annotations newly reported after 2010. We call these lists the OMIM-discovery validation set. The interest of this setup is that for a version of
<italic>Beegle</italic>
limited to using information obtained prior to 2010, the discoveries made in the 2010–2013 period will serve as a prospective validation of the tool.</p>
<p>Since we consider both the OMIM-search and the OMIM-discovery validation sets a secondary contribution of this work, we release the full lists in Supplementary Data 3–6. Hence, other researchers can use them for evaluating different gene prioritization methodologies. Table
<xref ref-type="table" rid="tbl1">1</xref>
provides a summary of the benchmark data sets.</p>
<table-wrap id="tbl1" orientation="portrait" position="float">
<label>Table 1.</label>
<caption>
<title>A summary of the OMIM benchmarks</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">No. of diseases</th>
<th align="left" rowspan="1" colspan="1">No. of genes</th>
<th align="left" rowspan="1" colspan="1">No. of disease-gene pairs</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">OMIM-search set</td>
<td align="left" rowspan="1" colspan="1">314</td>
<td align="left" rowspan="1" colspan="1">2055</td>
<td align="left" rowspan="1" colspan="1">2654</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">OMIM-discovery set 2010</td>
<td align="left" rowspan="1" colspan="1">104</td>
<td align="left" rowspan="1" colspan="1">859</td>
<td align="left" rowspan="1" colspan="1">959</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">OMIM-discovery set 2013</td>
<td align="left" rowspan="1" colspan="1">104</td>
<td align="left" rowspan="1" colspan="1">1107</td>
<td align="left" rowspan="1" colspan="1">1236</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">OMIM-discovery set 2013–2010</td>
<td align="left" rowspan="1" colspan="1">104</td>
<td align="left" rowspan="1" colspan="1">265</td>
<td align="left" rowspan="1" colspan="1">277</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TFN001">
<p>A summary of the different OMIM benchmarks that we used in our evaluation mentioning the counts of covered diseases, genes and disease-gene pairs. We used two versions (2010 and 2013).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="SEC2-9">
<title>The evaluation setup</title>
<p>The goal of the empirical evaluation is to (i) assess the quality of the text mining approaches in terms of retrieving known gene associations, and (ii) evaluate the ability of the genes retrieved to serve as training sets for building the Endeavour models and proposing novel hypothesis.</p>
<p>To address the first question, we employed the following methodology. First, we used the OMIM-search validation set to measure the percentage of recall in the top genes returned by
<italic>Beegle</italic>
in the search phase. Here, recall corresponds to the number of true positive genes retrieved at a certain rank threshold. For each query, we calculated the recall in the top 10, 25, 50 and 100 ranked genes. We then calculated the average recall over all disease queries. Note that our lists of ranked genes are based on the 2013 MEDLINE release. Second we compared
<italic>Beegle</italic>
to
<italic>MeSHOP</italic>
, which is a similar text mining tool.
<italic>MeSHOP</italic>
uses concept profile similarity to rank a list of human genes given an input MeSH term (
<xref rid="B22" ref-type="bibr">22</xref>
). We chose
<italic>MeSHOP</italic>
since their current results are also based on the 2013 PubMed release. We compared the two systems using a subset of the OMIM benchmark, such that each query returned a reasonable number of genes (ranging between 3 and 30). Since
<italic>MeSHOP</italic>
is restricted to using MeSH terms as queries, we further limited our subset to diseases where we could find equivalent MeSH terms. This resulted in 18 disease queries, which we provide in Supplementary Table S3. We call this the OMIM-comparison set. For every query in this set, we used each system to generate a gene ranking. Then, for each system and query, we computed the recall in the top 10, 25, 50 and 100 ranked genes (according to the corresponding OMIM genes reported in 2013). Finally, we computed both
<italic>Beegle</italic>
to
<italic>MeSHOP's</italic>
average recall over all queries for a specific rank threshold.</p>
<p>For the second evaluation, we used the literature-based validation set in one experiment, and then we used the OMIM-discovery validation set in a second one. For the literature-based set, we conducted the experiment as follows. We trained and compared three
<italic>Endeavour</italic>
models: one using the manually selected set of input genes, one using the top-10 genes retrieved by
<italic>Beegle</italic>
, and one using the top-
<italic>n</italic>
genes, where n is the number of genes in the query's manual training set. We chose the candidate set to be the whole genome. Given the results, we compared the recall, when considering the top 5%, 10% and 30% of the prioritized genes. We used these thresholds because they were previously used to compare
<italic>Endeavour</italic>
and other prioritization tools (
<xref rid="B10" ref-type="bibr">10</xref>
). On average, the thresholds correspond to the top 1000, 2000 and 6000 ranked genes. For the OMIM-discovery set, we ran
<italic>Endeavour</italic>
once with the set of OMIM genes reported until 2010, once with the top-10 genes retrieved by
<italic>Beegle</italic>
, and once with the top-
<italic>n</italic>
genes (where
<italic>n</italic>
is the number of OMIM genes for a given query). Again, we used the whole genome as our candidate set. Then we compared the recall, and the average AUC (Area Under the ROC Curve) results using each training set. For the ROC curves, we defined a number of rank thresholds (starting from 10 until 22000), then for each query we measured the TPR (True Positive Rate) and the FPR (False Positive Rate) at each threshold. Afterwards, we computed the average TPR and FPR results at each threshold given all the queries, and we used these average values to plot the ROC curves. Note that in both experiments, the genes retrieved by
<italic>Beegle</italic>
and the final prioritizations generated by
<italic>Endeavour</italic>
were based on literature and genomic data sources before 2010. The two validation sets are based on disease-gene associations reported from 2010 onwards. Thus our prioritizations were not contaminated by novel information.</p>
</sec>
</sec>
<sec sec-type="results" id="SEC3">
<title>RESULTS</title>
<sec id="SEC3-1">
<title>Evaluating the search phase</title>
<p>We present the results of the OMIM-search set in Table
<xref ref-type="table" rid="tbl2">2</xref>
. We observe that using co-occurrence alone results in an average recall of 54% in the top 10 versus 73% in the top 100. Similarly, using concept profile similarity alone results in an average recall of 48% in the top 10 versus 68% in the top 100. However, we observe that using the best rank achieves the best recall of 56% in our top 10 and 78% in our top 100 ranked genes. The same table also reports the number of genes that are confirmed by both approaches (separately and on average) in the top 10, 25, 50 and 100 ranked genes. We observe a 41% intersection in the top 10 ranked genes versus 21.7% at the top 100. In addition, we report the average recall at different
<italic>P</italic>
-value levels using co-occurrence in Table
<xref ref-type="table" rid="tbl3">3</xref>
. We observe similar results to measuring the recall at different rank levels as reported in Table
<xref ref-type="table" rid="tbl2">2</xref>
.</p>
<table-wrap id="tbl2" orientation="portrait" position="float">
<label>Table 2.</label>
<caption>
<title>The results of the OMIM-search set</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">Top 10</th>
<th align="left" rowspan="1" colspan="1">Top 25</th>
<th align="left" rowspan="1" colspan="1">Top 50</th>
<th align="left" rowspan="1" colspan="1">Top 100</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Co-occurrence</td>
<td align="left" rowspan="1" colspan="1">54%</td>
<td align="left" rowspan="1" colspan="1">64%</td>
<td align="left" rowspan="1" colspan="1">69%</td>
<td align="left" rowspan="1" colspan="1">73%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Concept profile similarity</td>
<td align="left" rowspan="1" colspan="1">48%</td>
<td align="left" rowspan="1" colspan="1">57%</td>
<td align="left" rowspan="1" colspan="1">62%</td>
<td align="left" rowspan="1" colspan="1">68%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Best rank</td>
<td align="left" rowspan="1" colspan="1">
<bold>56%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>67%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>74%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>78%</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>The average number of confirmed genes by both approaches</italic>
</td>
<td align="left" rowspan="1" colspan="1">
<italic>4.1</italic>
</td>
<td align="left" rowspan="1" colspan="1">
<italic>8.1</italic>
</td>
<td align="left" rowspan="1" colspan="1">
<italic>13.3</italic>
</td>
<td align="left" rowspan="1" colspan="1">
<italic>21.7</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TFN002">
<p>A comparison of the average recall in the top 10, 25, 50 and 100 returned genes using co-occurrence, concept profile similarity and best rank. The best rank returns the best recall results. We also present the average number of confirmed genes by both co-occurrence and concept profile similarity. We observe higher intersection in the top10 versus top100.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="tbl3" orientation="portrait" position="float">
<label>Table 3.</label>
<caption>
<title>The results of the OMIM-search set at different
<italic>P</italic>
-value levels using co-occurrence</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">The
<italic>P</italic>
-value level</th>
<th align="left" rowspan="1" colspan="1">0.0000001</th>
<th align="left" rowspan="1" colspan="1">0.001</th>
<th align="left" rowspan="1" colspan="1">0.005</th>
<th align="left" rowspan="1" colspan="1">0.01</th>
<th align="left" rowspan="1" colspan="1">0.1</th>
<th align="left" rowspan="1" colspan="1">0.5</th>
<th align="left" rowspan="1" colspan="1">1</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">The average rank level</td>
<td align="left" rowspan="1" colspan="1">24</td>
<td align="left" rowspan="1" colspan="1">44</td>
<td align="left" rowspan="1" colspan="1">55</td>
<td align="left" rowspan="1" colspan="1">60</td>
<td align="left" rowspan="1" colspan="1">104</td>
<td align="left" rowspan="1" colspan="1">158</td>
<td align="left" rowspan="1" colspan="1">16493</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">The average recall</td>
<td align="left" rowspan="1" colspan="1">0.6316</td>
<td align="left" rowspan="1" colspan="1">0.6944</td>
<td align="left" rowspan="1" colspan="1">0.7091</td>
<td align="left" rowspan="1" colspan="1">0.7139</td>
<td align="left" rowspan="1" colspan="1">0.7451</td>
<td align="left" rowspan="1" colspan="1">0.7678</td>
<td align="left" rowspan="1" colspan="1">1.0000</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TFN003">
<p>The average recall at different
<italic>P</italic>
-value levels using co-occurrence. We observe similar results to measuring the recall at different rank levels.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Table
<xref ref-type="table" rid="tbl4">4</xref>
compares
<italic>Beegle</italic>
and
<italic>MeSHOP</italic>
on the OMIM-comparison set. We observe that
<italic>Beegle</italic>
results in an average recall of 69% in the top 10 and 84% in the top 100. Similarly,
<italic>MeSHOP</italic>
results in an average recall of 51% in the top 10 and 63% in the top 100. Supplementary Data 7 and 8 contain the full lists of returned genes by each tool.</p>
<table-wrap id="tbl4" orientation="portrait" position="float">
<label>Table 4.</label>
<caption>
<title>The results on the OMIM-comparison set</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1"></th>
<th align="left" rowspan="1" colspan="1">Top 10</th>
<th align="left" rowspan="1" colspan="1">Top 25</th>
<th align="left" rowspan="1" colspan="1">Top 50</th>
<th align="left" rowspan="1" colspan="1">Top 100</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Beegle</italic>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>69%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>80%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>83%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>84%</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>MeSHOP</italic>
</td>
<td align="left" rowspan="1" colspan="1">51%</td>
<td align="left" rowspan="1" colspan="1">60%</td>
<td align="left" rowspan="1" colspan="1">62%</td>
<td align="left" rowspan="1" colspan="1">63%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TFN004">
<p>A comparison of the average recall in the top 10, 25, 50 and 100 returned genes using
<italic>Beegle</italic>
and
<italic>MeSHOP</italic>
.
<italic>Beegle</italic>
obtains better recall results, where it improves
<italic>MeSHOP</italic>
by 18% in the top 10 returned genes.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="SEC3-2">
<title>Evaluating the discovery phase</title>
<p>Table
<xref ref-type="table" rid="tbl5">5</xref>
presents the results of the literature-based set. Using the top-10 genes retrieved by
<italic>Beegle</italic>
as an input set to train the
<italic>Endeavour</italic>
models results in an average recall of 41.2%, 48.5% and 77.5% in the top 5%, 10% and 30% prioritized genes, and using the top-
<italic>n</italic>
genes similarly results in an average recall of 33.3%, 46.6% and 73.0%. In comparison, using manually extracted input sets result in recalls of 28.6%, 38.1% and 71.4%. Hence
<italic>Beegle</italic>
's automatic input set can improve the recall by up to 12.6%, 10.4% and 6.1% in at the top 5%, 10% and 30% prioritized genes, respectively.</p>
<table-wrap id="tbl5" orientation="portrait" position="float">
<label>Table 5.</label>
<caption>
<title>The results on the literature-based set</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Input set</th>
<th align="left" rowspan="1" colspan="1">Recall in top 5%</th>
<th align="left" rowspan="1" colspan="1">Recall in top 10%</th>
<th align="left" rowspan="1" colspan="1">Recall in top 30%</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Manually-extracted</td>
<td align="left" rowspan="1" colspan="1">28.6%</td>
<td align="left" rowspan="1" colspan="1">38.1%</td>
<td align="left" rowspan="1" colspan="1">71.4%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Beegle's</italic>
top-10</td>
<td align="left" rowspan="1" colspan="1">
<bold>41.2%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>48.5%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>77.5%</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Beegle's top-n</italic>
</td>
<td align="left" rowspan="1" colspan="1">33.3%</td>
<td align="left" rowspan="1" colspan="1">46.6%</td>
<td align="left" rowspan="1" colspan="1">73.0%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TFN005">
<p>A comparison of the average recall results for the final gene prioritizations (using the literature-based set) in the top 5%, 10% and 30% prioritized genes using the manually-extracted,
<italic>Beegle</italic>
-extracted top-10, and 
<italic>Beegle</italic>
-extracted top-
<italic>n</italic>
input sets. The automatic input set from
<italic>Beegle</italic>
improves the recall results (with slight improvement using top-10).</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>Table
<xref ref-type="table" rid="tbl6">6</xref>
presents the recall results for the OMIM-discovery set. The results are comparable when using
<italic>Beegle</italic>
's top retrieved genes and OMIM-reported genes as input sets. This corresponds to an average recall of 34.3%, 43.3% and 65.7% in the top 5%, 10% and 30% prioritized genes. Figure
<xref ref-type="fig" rid="F3">3</xref>
shows the ROC curves comparing all input sets. On average, the AUC is 0.73.</p>
<fig id="F3" orientation="portrait" position="float">
<label>Figure 3.</label>
<caption>
<p>The ROC curves for the final gene prioritizations on the OMIM-discovery set. The green curve corresponds to OMIM 2010 derived training sets, the magenta curve corresponds to
<italic>Beegle's</italic>
top-10 training sets, and the blue curve corresponds to
<italic>Beegle's</italic>
top-
<italic>n</italic>
training sets. The results are comparable.</p>
</caption>
<graphic xlink:href="gkv905fig3"></graphic>
</fig>
<table-wrap id="tbl6" orientation="portrait" position="float">
<label>Table 6.</label>
<caption>
<title>The results on the OMIM-discovery set</title>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Input set</th>
<th align="left" rowspan="1" colspan="1">Recall in top 5%</th>
<th align="left" rowspan="1" colspan="1">Recall in top 10%</th>
<th align="left" rowspan="1" colspan="1">Recall in top 30%</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">OMIM-reported</td>
<td align="left" rowspan="1" colspan="1">35%</td>
<td align="left" rowspan="1" colspan="1">45%</td>
<td align="left" rowspan="1" colspan="1">67%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Beegle's</italic>
top-10</td>
<td align="left" rowspan="1" colspan="1">
<bold>36%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>44%</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>66%</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<italic>Beegle's</italic>
top-
<italic>n</italic>
</td>
<td align="left" rowspan="1" colspan="1">32%</td>
<td align="left" rowspan="1" colspan="1">41%</td>
<td align="left" rowspan="1" colspan="1">64%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TFN006">
<p>A comparison of the average recall results for the final gene prioritizations on the OMIM-discovery set for the top 5%, 10% and 30% of the prioritized genes using the manually-extracted,
<italic>Beegle</italic>
-extracted top-10, and
<italic>Beegle</italic>
-extracted top-n input sets. The automatic input set from
<italic>Beegle</italic>
shows comparable results to the one extracted from OMIM (with slight improvement using top-10).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="discussion" id="SEC4">
<title>DISCUSSION</title>
<p>Our results show the potential of
<italic>Beegle</italic>
to annotate genes to diseases starting from the literature. On the one hand, the results from the search phase show that counting on either an explicit signal (that comes directly from the text), or an implicit one (that is interpreted from the fraction of shared concepts), to annotate genes to diseases tend to work successfully. They also show that combining both the explicit and the implicit signals results in a stronger retrieval as demonstrated by the increased number of experimentally-validated genes that appear in the top ranked genes. In addition, we found that
<italic>Beegle</italic>
improves the recall by an average of 18% and 21% at the top 10 and 100 returned genes when compared to
<italic>MeSHOP</italic>
. On the other hand, the results from the discovery phase demonstrated that
<italic>Beegle</italic>
can automatically generate an interesting training set to build models for predicting novel genes. They also show that using the top-10 returned genes as a training set slightly improves the performance relative to using the top-
<italic>n</italic>
genes. While we believe that expert input remains invaluable, we were surprised to observe that the results of automatic retrieval are at least as good as those for manually curated gene sets. We do however believe that additional review of potential training genes identified by
<italic>Beegle</italic>
(and the addition of any important gene missed during the retrieval phase) will further enhance the performance of the approach significantly (although this is difficult to quantify in a benchmark).</p>
<p>
<italic>Beegle</italic>
allows experts to find existing gene associations and predict new ones in an easy and straightforward manner. First, through the free-text query support, users can try any combination of biomedical concepts of interest, for which they can explore gene associations. Second, in the search phase, for every gene returned,
<italic>Beegle</italic>
presents two additional outputs: (i) at least one piece of literature in which the gene and the query are reported together, and (ii) a word cloud that views the concepts that most describe the gene in comparison to those of the query. These extra outputs provide an additional level of insight through which users can further assess the query-gene association and decide whether to add a given gene to the training set or not. Finally, in the discovery phase, in addition to the global rank,
<italic>Beegle</italic>
presents a detailed rank of the candidate genes according to the different genomic sources employed. Hence users are provided with additional insights that help them to assess the viability of the candidate gene to further decide whether or not to take it to the next step (e.g. for wetlab experiments).</p>
<p>Methodologically,
<italic>Beegle</italic>
mines the abstracts on MEDLINE to retrieve a ranked list of known genes using two text mining techniques. The first measures co-occurrence and the second measures concept profile similarity between genes and biomedical concepts. This is novel compared to previous research that has focused on using one of the two techniques in isolation to generate biomedical associations (
<xref rid="B17" ref-type="bibr">17</xref>
<xref rid="B22" ref-type="bibr">22</xref>
).
<italic>CoPub</italic>
(
<xref rid="B21" ref-type="bibr">21</xref>
) and
<italic>MeSHOP</italic>
(
<xref rid="B22" ref-type="bibr">22</xref>
) are two examples of such methodology.
<italic>Beegle</italic>
separates the search for known genes from the search for possible candidate genes. This is different from existing work that merge known and unknown gene associations in the same ranking (
<xref rid="B21" ref-type="bibr">21</xref>
<xref rid="B24" ref-type="bibr">24</xref>
).
<italic>BITOLA</italic>
(
<xref rid="B23" ref-type="bibr">23</xref>
) and
<italic>Genie</italic>
(
<xref rid="B24" ref-type="bibr">24</xref>
) are two examples of such methodology. Furthermore,
<italic>Beegle</italic>
supports the search using any free text query, which among existing systems is only possible in
<italic>Genie</italic>
. The rest of the existing tools are limited to specific vocabularies (e.g. MeSH terms). Supplementary Table S4 conceptually compares
<italic>Beegle</italic>
and four of the most closely related systems:
<italic>CoPub, MeSHOP, BITOLA</italic>
and
<italic>Genie</italic>
.</p>
<p>Nevertheless,
<italic>Beegle</italic>
is limited in the following ways. On the one hand, the user of
<italic>Beegle</italic>
is not allowed to choose the genomic sources which are used in the discovery phase and has to follow our preselection of sources (which proved to work best in our experiments). Also the user is expected to manually add training genes only in the form of gene symbols (that have a corresponding Entrez id). On the other hand, the response time of
<italic>Endeavour</italic>
is relatively slow and it can take up to 20 min to prioritize the whole genome. This is not optimal when our users expect an instantaneous response time, based on their experience with other search engines, such as Google for example.</p>
<p>For future work, we plan to enhance
<italic>Beegle</italic>
as follows. One way would be by improving the identification of known genes, which would be possible through applying enhanced text mining approaches. For instance, one approach could be to use a refined vocabulary set to generate the concept profiles for our queries and genes. This could be achieved by selecting high-quality concepts and discarding confusing ones. Another way would be to develop even better validation sets to measure the quality of the gene prioritizations. In this work, most of our control diseases are linked to just one future gene. We thus believe that a more extensive set with better gene coverage will give us a better insight into the performance of our tool. We also plan to integrate
<italic>Beegle</italic>
with variant prioritization tools (that are complementary to gene prioritization tools), such as
<italic>eXtasy</italic>
(
<xref rid="B31" ref-type="bibr">31</xref>
). We also plan to enhance the web interface, which is possible through (i) adding user accounts support (for managing personal queries, gene lists, etc.) and (ii) improving the response time by using a compact version of our data sets (e.g. compacting the vocabulary).</p>
</sec>
<sec id="SEC5">
<title>AVAILABILITY</title>
<p>
<italic>Beegle</italic>
 is publicly available at:
<ext-link ext-link-type="uri" xlink:href="http://beegle.esat.kuleuven.be/">http://beegle.esat.kuleuven.be/</ext-link>
.</p>
</sec>
<sec sec-type="supplementary-material" id="SEC6">
<title>SUPPLEMENTARY DATA</title>
<p>
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/lookup/suppl/doi:10.1093/nar/gkv905/-/DC1">Supplementary Data</ext-link>
are available at NAR Online.</p>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>SUPPLEMENTARY DATA</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_44_2_e18__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="x-zip-compressed" xlink:href="supp_gkv905_nar-01815-met-n-2015-File011.zip"></media>
</supplementary-material>
</sec>
</body>
<back>
<fn-group>
<fn id="FN1" fn-type="present-address">
<p>Present address: Sarah ElShal, Department of Electrical Engineering (ESAT) STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven 3001, Belgium.</p>
</fn>
</fn-group>
<sec id="SEC7">
<title>FUNDING</title>
<p>Research Council KU Leuven [CoE PFV/10/016 SymBioSys to Y.M. and OT/11/051 to J.D.]; Innovation by Science and Technology [to Y.M.]; Industrial Research fund [to Y.M.]; Hercules Stichting [to Y.M.]; iMinds Medical Information Technologies [SBO 2015 to Y.M.]; EU FP7 Marie Curie Career Integration Grant [294068 to J.D.]; FWO-Vlaanderen [G.0356.12 to J.D.]; IMEC mandaat [Ph.D. mandaat to A.A.]. Funding for open access charge: Research Council KU Leuven.</p>
<p>
<italic>Conflict of interest statement</italic>
. None declared.</p>
</sec>
<ref-list>
<title>REFERENCES</title>
<ref id="B1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Perez-Iratxeta</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Andrade</surname>
<given-names>M.A.</given-names>
</name>
</person-group>
<article-title>Association of genes to genetically inherited diseases using data mining</article-title>
<source>Nat. Genet.</source>
<year>2002</year>
<volume>31</volume>
<fpage>316</fpage>
<lpage>319</lpage>
<pub-id pub-id-type="pmid">12006977</pub-id>
</element-citation>
</ref>
<ref id="B2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moody</surname>
<given-names>S.E.</given-names>
</name>
<name>
<surname>Boehm</surname>
<given-names>J.S.</given-names>
</name>
<name>
<surname>Barbie</surname>
<given-names>D.A.</given-names>
</name>
<name>
<surname>Hahn</surname>
<given-names>W.C.</given-names>
</name>
</person-group>
<article-title>Functional genomics and cancer drug target discovery</article-title>
<source>Curr. Opin. Mol. Ther.</source>
<year>2010</year>
<volume>12</volume>
<fpage>284</fpage>
<lpage>293</lpage>
<pub-id pub-id-type="pmid">20521217</pub-id>
</element-citation>
</ref>
<ref id="B3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.Y.</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>T.A.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Meta-analysis of genome-wide association studies of adult height in East Asians identifies 17 novel loci</article-title>
<article-title>Hum Mol Genet</article-title>
<year>2014</year>
<volume>24</volume>
<fpage>1791</fpage>
</element-citation>
</ref>
<ref id="B4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qiu</surname>
<given-names>Y.H.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>F.Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.J.</given-names>
</name>
<name>
<surname>Lei</surname>
<given-names>S.F.</given-names>
</name>
</person-group>
<article-title>Identification of novel risk genes associated with type 1 diabetes mellitus using a genome-wide gene-based association analysis</article-title>
<source>J. Diabetes Investig.</source>
<year>2014</year>
<volume>5</volume>
<fpage>649</fpage>
<lpage>656</lpage>
</element-citation>
</ref>
<ref id="B5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Penttilä</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jokela</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bouquin</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Saukkonen</surname>
<given-names>A.M.</given-names>
</name>
<name>
<surname>Toivanen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Udd</surname>
<given-names>B.</given-names>
</name>
</person-group>
<article-title>Late-Onset spinal motor neuronopathy is caused by mutation in CHCHD10</article-title>
<source>Ann. Neurol.</source>
<year>2014</year>
<volume>77</volume>
<fpage>163</fpage>
<lpage>172</lpage>
<pub-id pub-id-type="pmid">25428574</pub-id>
</element-citation>
</ref>
<ref id="B6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tranchevent</surname>
<given-names>L.C.</given-names>
</name>
<name>
<surname>Capdevila</surname>
<given-names>F.B.</given-names>
</name>
<name>
<surname>Nitsch</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>De Causmaecker</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<article-title>A guide to web tools to prioritize candidate genes</article-title>
<source>Brief. Bioinform.</source>
<year>2011</year>
<volume>11</volume>
<fpage>1</fpage>
<lpage>11</lpage>
</element-citation>
</ref>
<ref id="B7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Doncheva</surname>
<given-names>N.T.</given-names>
</name>
<name>
<surname>Kacprowski</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Albrecht</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Recent approaches to the prioritization of candidate disease genes</article-title>
<source>Wiley Interdiscip. Rev. Syst. Biol. Med.</source>
<year>2012</year>
<volume>4</volume>
<fpage>429</fpage>
<lpage>442</lpage>
<pub-id pub-id-type="pmid">22689539</pub-id>
</element-citation>
</ref>
<ref id="B8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Piro</surname>
<given-names>R.M.</given-names>
</name>
<name>
<surname>Di Cunto</surname>
<given-names>F.</given-names>
</name>
</person-group>
<article-title>Computational approaches to disease-gene prediction: rationale, classification and successes</article-title>
<source>FEBS J.</source>
<year>2012</year>
<volume>279</volume>
<fpage>678</fpage>
<lpage>696</lpage>
<pub-id pub-id-type="pmid">22221742</pub-id>
</element-citation>
</ref>
<ref id="B9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Moreau</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Tranchevent</surname>
<given-names>L.C.</given-names>
</name>
</person-group>
<article-title>Computational tools for prioritizing candidate genes: boosting disease-gene discovery</article-title>
<source>Nat. Rev. Genet.</source>
<year>2012</year>
<volume>13</volume>
<fpage>523</fpage>
<lpage>536</lpage>
<pub-id pub-id-type="pmid">22751426</pub-id>
</element-citation>
</ref>
<ref id="B10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Börnigen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Tranchevent</surname>
<given-names>L.C.</given-names>
</name>
<name>
<surname>Bonachela-Capdevil</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Devriendt</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>De Causmaecker</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<article-title>An unbiased evaluation of gene prioritization tools</article-title>
<source>Bioinformatics.</source>
<year>2012</year>
<volume>28</volume>
<fpage>3081</fpage>
<lpage>3088</lpage>
<pub-id pub-id-type="pmid">23047555</pub-id>
</element-citation>
</ref>
<ref id="B11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tranchevent</surname>
<given-names>L.C.</given-names>
</name>
<name>
<surname>Barriot</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Van Vooren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Van Loo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Coessens</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Aerts</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<article-title>ENDEAVOUR update: a web resource for gene prioritization in multiple species</article-title>
<source>Nucleic Acids Res.</source>
<year>2008</year>
<volume>36</volume>
<fpage>W377</fpage>
<lpage>W384</lpage>
<pub-id pub-id-type="pmid">18508807</pub-id>
</element-citation>
</ref>
<ref id="B12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Seelow</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Schwarz</surname>
<given-names>J.M.</given-names>
</name>
<name>
<surname>Schuelke</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>GeneDistiller–distilling candidate genes from linkage intervals</article-title>
<source>PLoS One</source>
<year>2008</year>
<volume>3</volume>
<fpage>e3874</fpage>
<pub-id pub-id-type="pmid">19057649</pub-id>
</element-citation>
</ref>
<ref id="B13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bardes</surname>
<given-names>E.E.</given-names>
</name>
<name>
<surname>Aronow</surname>
<given-names>B.J.</given-names>
</name>
<name>
<surname>Jegga</surname>
<given-names>A.G.</given-names>
</name>
</person-group>
<article-title>ToppGene Suite for gene list enrichment analysis and candidate gene prioritization</article-title>
<article-title>Nucleic Acids Res</article-title>
<year>2009</year>
<volume>37</volume>
<fpage>W305</fpage>
</element-citation>
</ref>
<ref id="B14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Amberger</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bocchini</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Hamosh</surname>
<given-names>A.</given-names>
</name>
</person-group>
<article-title>A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®)</article-title>
<source>Hum. Mutat.</source>
<year>2011</year>
<volume>32</volume>
<fpage>564</fpage>
<lpage>567</lpage>
<pub-id pub-id-type="pmid">21472891</pub-id>
</element-citation>
</ref>
<ref id="B15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Becker</surname>
<given-names>K.G.</given-names>
</name>
<name>
<surname>Barnes</surname>
<given-names>K.C.</given-names>
</name>
<name>
<surname>Bright</surname>
<given-names>T.J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.A.</given-names>
</name>
</person-group>
<article-title>The Genetic Association Database</article-title>
<source>Nat. Genet.</source>
<year>2004</year>
<volume>36</volume>
<fpage>431</fpage>
<lpage>432</lpage>
<pub-id pub-id-type="pmid">15118671</pub-id>
</element-citation>
</ref>
<ref id="B16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jensen</surname>
<given-names>J.L.</given-names>
</name>
<name>
<surname>Saric</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P.</given-names>
</name>
</person-group>
<article-title>Literature mining for the biologist: from information retrieval to biological discovery</article-title>
<source>Nat. Rev. Genet.</source>
<year>2006</year>
<volume>7</volume>
<fpage>119</fpage>
<lpage>129</lpage>
<pub-id pub-id-type="pmid">16418747</pub-id>
</element-citation>
</ref>
<ref id="B17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jelier</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Jenster</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dorssers</surname>
<given-names>L.C.</given-names>
</name>
<name>
<surname>Wouters</surname>
<given-names>B.J.</given-names>
</name>
<name>
<surname>Hendriksen</surname>
<given-names>P.J.</given-names>
</name>
<name>
<surname>Mons</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Delwel</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Kors</surname>
<given-names>J.A.</given-names>
</name>
</person-group>
<article-title>Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation</article-title>
<source>BMC Bioinformatics.</source>
<year>2007</year>
<volume>18</volume>
<fpage>8</fpage>
<lpage>14</lpage>
</element-citation>
</ref>
<ref id="B18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jelier</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Schuemie</surname>
<given-names>M.J.</given-names>
</name>
<name>
<surname>Roes</surname>
<given-names>P.J.</given-names>
</name>
<name>
<surname>van Mulligen</surname>
<given-names>E.M.</given-names>
</name>
<name>
<surname>Kors</surname>
<given-names>J.A.</given-names>
</name>
</person-group>
<article-title>Literature-based concept profiles for gene annotation: the issue of weighting</article-title>
<source>Int. J. Med. Inform.</source>
<year>2008</year>
<volume>77</volume>
<fpage>354</fpage>
<lpage>362</lpage>
<pub-id pub-id-type="pmid">17827057</pub-id>
</element-citation>
</ref>
<ref id="B19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jelier</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Schuemie</surname>
<given-names>M.J.</given-names>
</name>
<name>
<surname>Veldhoven</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Dorssers</surname>
<given-names>L.C.</given-names>
</name>
<name>
<surname>Jenster</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kors</surname>
<given-names>J.A.</given-names>
</name>
</person-group>
<article-title>Anni 2.0: a multipurpose text-mining tool for the life sciences</article-title>
<source>Genome Biol.</source>
<year>2008</year>
<volume>9</volume>
<fpage>R96</fpage>
<pub-id pub-id-type="pmid">18549479</pub-id>
</element-citation>
</ref>
<ref id="B20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Glenisson</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Coessens</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Van Vooren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mathys</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B.</given-names>
</name>
</person-group>
<article-title>TXTGate: profiling gene groups with text-based information</article-title>
<source>Genome Biol.</source>
<year>2004</year>
<volume>5</volume>
<fpage>R43</fpage>
<pub-id pub-id-type="pmid">15186494</pub-id>
</element-citation>
</ref>
<ref id="B21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fleuren</surname>
<given-names>W.W.</given-names>
</name>
<name>
<surname>Verhoeven</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Frijters</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Heupers</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Polman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>van Schaik</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>de Vlieg</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Alkema</surname>
<given-names>W.</given-names>
</name>
</person-group>
<article-title>CoPub update: CoPub 5.0 a text mining system to answer biological questions</article-title>
<source>Nucleic Acids Res.</source>
<year>2011</year>
<volume>39</volume>
<fpage>450</fpage>
</element-citation>
</ref>
<ref id="B22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheung</surname>
<given-names>W.A.</given-names>
</name>
<name>
<surname>Ouellette</surname>
<given-names>B.F.</given-names>
</name>
<name>
<surname>Wasserman</surname>
<given-names>W.W.</given-names>
</name>
</person-group>
<article-title>Inferring novel gene-disease associations using medical subject heading over-representation profiles</article-title>
<source>Genome Med.</source>
<year>2012</year>
<volume>4</volume>
<fpage>75</fpage>
<pub-id pub-id-type="pmid">23021552</pub-id>
</element-citation>
</ref>
<ref id="B23">
<label>23.</label>
<element-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Hristovski</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Rindflesch</surname>
<given-names>T.C.</given-names>
</name>
<name>
<surname>Peterlin</surname>
<given-names>B.</given-names>
</name>
</person-group>
<article-title>Exploiting semantic relations for literature-based discovery</article-title>
<article-title>AMIA Annu. Symp. Proc</article-title>
<year>2006</year>
<fpage>349</fpage>
<lpage>353</lpage>
</element-citation>
</ref>
<ref id="B24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fontaine</surname>
<given-names>J.F.</given-names>
</name>
<name>
<surname>Priller</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Barbosa-Silva</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Andrade-Navarro</surname>
<given-names>M.A.</given-names>
</name>
</person-group>
<article-title>Genie: literature-based gene prioritization at multi genomic scale</article-title>
<source>Nucl. Acids Res.</source>
<year>2011</year>
<volume>39</volume>
<fpage>W455</fpage>
<lpage>W461</lpage>
<pub-id pub-id-type="pmid">21609954</pub-id>
</element-citation>
</ref>
<ref id="B25">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aronson</surname>
<given-names>A.R.</given-names>
</name>
<name>
<surname>Lang</surname>
<given-names>F.M.</given-names>
</name>
</person-group>
<article-title>An overview of MetaMap: historical perspective and recent advances</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2010</year>
<volume>17</volume>
<fpage>229</fpage>
<lpage>236</lpage>
<pub-id pub-id-type="pmid">20442139</pub-id>
</element-citation>
</ref>
<ref id="B26">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bodenreider</surname>
<given-names>O.</given-names>
</name>
</person-group>
<article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<fpage>D267</fpage>
<lpage>D270</lpage>
<pub-id pub-id-type="pmid">14681409</pub-id>
</element-citation>
</ref>
<ref id="B27">
<label>27.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Manning</surname>
<given-names>C.D.</given-names>
</name>
<name>
<surname>Raghavan</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Schutze</surname>
<given-names>H.</given-names>
</name>
</person-group>
<source>Introduction to Information Retrieval</source>
<year>2008</year>
<publisher-name>Cambridge University Press</publisher-name>
<fpage>100</fpage>
<lpage>123</lpage>
</element-citation>
</ref>
<ref id="B28">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aerts</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lambrechts</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Maity</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Van Loo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Coessens</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>De Smet</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Tranchevent</surname>
<given-names>L.C.</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Marynen</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Hassan</surname>
<given-names>B.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Gene prioritization through genomic data fusion</article-title>
<source>Nat. Biotechnol.</source>
<year>2006</year>
<volume>24</volume>
<fpage>537</fpage>
<lpage>544</lpage>
<pub-id pub-id-type="pmid">16680138</pub-id>
</element-citation>
</ref>
<ref id="B29">
<label>29.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lopez-Bigas</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Ouzounis</surname>
<given-names>C.A.</given-names>
</name>
</person-group>
<article-title>Genome-wide identification of genes likely to be involved in human genetic disease</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<fpage>3108</fpage>
<lpage>3114</lpage>
<pub-id pub-id-type="pmid">15181176</pub-id>
</element-citation>
</ref>
<ref id="B30">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lukk</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kapushesky</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Nikkilä</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Parkinson</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Goncalves</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Huber</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ukkonen</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Brazma</surname>
<given-names>A.</given-names>
</name>
</person-group>
<article-title>A global map of human gene expression</article-title>
<source>Nat. Biotechnol.</source>
<year>2010</year>
<volume>28</volume>
<fpage>322</fpage>
<lpage>324</lpage>
<pub-id pub-id-type="pmid">20379172</pub-id>
</element-citation>
</ref>
<ref id="B31">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sifrim</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Popovic</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Tranchevent</surname>
<given-names>L.C.</given-names>
</name>
<name>
<surname>Ardeshirdavani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sakai</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Konings</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Vermeesch</surname>
<given-names>J.R.</given-names>
</name>
<name>
<surname>Aerts</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y.</given-names>
</name>
</person-group>
<article-title>eXtasy: variant prioritization by genomic data fusion</article-title>
<source>Nat. Methods.</source>
<year>2013</year>
<volume>10</volume>
<fpage>1083</fpage>
<lpage>1084</lpage>
<pub-id pub-id-type="pmid">24076761</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Belgique/explor/OpenAccessBelV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000098  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000098  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Belgique
   |area=    OpenAccessBelV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Dec 1 00:43:49 2016. Site generation: Wed Mar 6 14:51:30 2024