Serveur d'exploration sur SGML

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

Identifieur interne : 000158 ( Pmc/Corpus ); précédent : 000157; suivant : 000159

Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids

Auteurs : Rick Jordan ; Shyam Visweswaran ; Vanathi Gopalakrishnan

Source :

RBID : PMC:4215335

Abstract

Background

Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids.

Methodology

A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance.

Results

Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer.

Conclusions

We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.


Url:
DOI: 10.1186/2043-9113-4-13
PubMed: 25379168
PubMed Central: 4215335

Links to Exploration step

PMC:4215335

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids</title>
<author>
<name sortKey="Jordan, Rick" sort="Jordan, Rick" uniqKey="Jordan R" first="Rick" last="Jordan">Rick Jordan</name>
<affiliation>
<nlm:aff id="I1">Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Visweswaran, Shyam" sort="Visweswaran, Shyam" uniqKey="Visweswaran S" first="Shyam" last="Visweswaran">Shyam Visweswaran</name>
<affiliation>
<nlm:aff id="I1">Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gopalakrishnan, Vanathi" sort="Gopalakrishnan, Vanathi" uniqKey="Gopalakrishnan V" first="Vanathi" last="Gopalakrishnan">Vanathi Gopalakrishnan</name>
<affiliation>
<nlm:aff id="I1">Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25379168</idno>
<idno type="pmc">4215335</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215335</idno>
<idno type="RBID">PMC:4215335</idno>
<idno type="doi">10.1186/2043-9113-4-13</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000158</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000158</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids</title>
<author>
<name sortKey="Jordan, Rick" sort="Jordan, Rick" uniqKey="Jordan R" first="Rick" last="Jordan">Rick Jordan</name>
<affiliation>
<nlm:aff id="I1">Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Visweswaran, Shyam" sort="Visweswaran, Shyam" uniqKey="Visweswaran S" first="Shyam" last="Visweswaran">Shyam Visweswaran</name>
<affiliation>
<nlm:aff id="I1">Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Gopalakrishnan, Vanathi" sort="Gopalakrishnan, Vanathi" uniqKey="Gopalakrishnan V" first="Vanathi" last="Gopalakrishnan">Vanathi Gopalakrishnan</name>
<affiliation>
<nlm:aff id="I1">Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I3">Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Journal of Clinical Bioinformatics</title>
<idno type="eISSN">2043-9113</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids.</p>
</sec>
<sec>
<title>Methodology</title>
<p>A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance.</p>
</sec>
<sec>
<title>Results</title>
<p>Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Hirschman, L" uniqKey="Hirschman L">L Hirschman</name>
</author>
<author>
<name sortKey="Park, Jc" uniqKey="Park J">JC Park</name>
</author>
<author>
<name sortKey="Tsujii, J" uniqKey="Tsujii J">J Tsujii</name>
</author>
<author>
<name sortKey="Wong, L" uniqKey="Wong L">L Wong</name>
</author>
<author>
<name sortKey="Wu, Ch" uniqKey="Wu C">CH Wu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Adamic, La" uniqKey="Adamic L">LA Adamic</name>
</author>
<author>
<name sortKey="Wilkinson, D" uniqKey="Wilkinson D">D Wilkinson</name>
</author>
<author>
<name sortKey="Huberman, Ba" uniqKey="Huberman B">BA Huberman</name>
</author>
<author>
<name sortKey="Adar, E" uniqKey="Adar E">E Adar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wren, Jd" uniqKey="Wren J">JD Wren</name>
</author>
<author>
<name sortKey="Bekeredjian, R" uniqKey="Bekeredjian R">R Bekeredjian</name>
</author>
<author>
<name sortKey="Stewart, Ja" uniqKey="Stewart J">JA Stewart</name>
</author>
<author>
<name sortKey="Shohet, Rv" uniqKey="Shohet R">RV Shohet</name>
</author>
<author>
<name sortKey="Garner, Hr" uniqKey="Garner H">HR Garner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xuan, W" uniqKey="Xuan W">W Xuan</name>
</author>
<author>
<name sortKey="Wang, P" uniqKey="Wang P">P Wang</name>
</author>
<author>
<name sortKey="Watson, Sj" uniqKey="Watson S">SJ Watson</name>
</author>
<author>
<name sortKey="Meng, F" uniqKey="Meng F">F Meng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hristovski, D" uniqKey="Hristovski D">D Hristovski</name>
</author>
<author>
<name sortKey="Peterlin, B" uniqKey="Peterlin B">B Peterlin</name>
</author>
<author>
<name sortKey="Mitchell, Ja" uniqKey="Mitchell J">JA Mitchell</name>
</author>
<author>
<name sortKey="Humphrey, Sm" uniqKey="Humphrey S">SM Humphrey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Novichkova, S" uniqKey="Novichkova S">S Novichkova</name>
</author>
<author>
<name sortKey="Egorov, S" uniqKey="Egorov S">S Egorov</name>
</author>
<author>
<name sortKey="Daraseila, N" uniqKey="Daraseila N">N Daraseila</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Srinivasan, P" uniqKey="Srinivasan P">P Srinivasan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leonard, Je" uniqKey="Leonard J">JE Leonard</name>
</author>
<author>
<name sortKey="Colombe, Jb" uniqKey="Colombe J">JB Colombe</name>
</author>
<author>
<name sortKey="Levy, Jl" uniqKey="Levy J">JL Levy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jensen, Lj" uniqKey="Jensen L">LJ Jensen</name>
</author>
<author>
<name sortKey="Saric, J" uniqKey="Saric J">J Saric</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krallinger, M" uniqKey="Krallinger M">M Krallinger</name>
</author>
<author>
<name sortKey="Valencia, A" uniqKey="Valencia A">A Valencia</name>
</author>
<author>
<name sortKey="Hirschman, L" uniqKey="Hirschman L">L Hirschman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cohen, Am" uniqKey="Cohen A">AM Cohen</name>
</author>
<author>
<name sortKey="Hersh, Wr" uniqKey="Hersh W">WR Hersh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Swanson, Dr" uniqKey="Swanson D">DR Swanson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhu, S" uniqKey="Zhu S">S Zhu</name>
</author>
<author>
<name sortKey="Okuno, Y" uniqKey="Okuno Y">Y Okuno</name>
</author>
<author>
<name sortKey="Tsujimoto, G" uniqKey="Tsujimoto G">G Tsujimoto</name>
</author>
<author>
<name sortKey="Mamitsuka, H" uniqKey="Mamitsuka H">H Mamitsuka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Frijters, R" uniqKey="Frijters R">R Frijters</name>
</author>
<author>
<name sortKey="Van Vugt, M" uniqKey="Van Vugt M">M Van Vugt</name>
</author>
<author>
<name sortKey="Smeets, R" uniqKey="Smeets R">R Smeets</name>
</author>
<author>
<name sortKey="Van Schaik, R" uniqKey="Van Schaik R">R Van Schaik</name>
</author>
<author>
<name sortKey="De Vlieg, J" uniqKey="De Vlieg J">J De Vlieg</name>
</author>
<author>
<name sortKey="Alkema, W" uniqKey="Alkema W">W Alkema</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Liu, C" uniqKey="Liu C">C Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Al Mubaid, H" uniqKey="Al Mubaid H">H Al-Mubaid</name>
</author>
<author>
<name sortKey="Singh, Rk" uniqKey="Singh R">RK Singh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Andrade, Ma" uniqKey="Andrade M">MA Andrade</name>
</author>
<author>
<name sortKey="Valencia, A" uniqKey="Valencia A">A Valencia</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Younesi, E" uniqKey="Younesi E">E Younesi</name>
</author>
<author>
<name sortKey="Toldo, L" uniqKey="Toldo L">L Toldo</name>
</author>
<author>
<name sortKey="Muller, B" uniqKey="Muller B">B Muller</name>
</author>
<author>
<name sortKey="Friedrich, Cm" uniqKey="Friedrich C">CM Friedrich</name>
</author>
<author>
<name sortKey="Novac, N" uniqKey="Novac N">N Novac</name>
</author>
<author>
<name sortKey="Scheer, A" uniqKey="Scheer A">A Scheer</name>
</author>
<author>
<name sortKey="Hofmann Apitius, M" uniqKey="Hofmann Apitius M">M Hofmann-Apitius</name>
</author>
<author>
<name sortKey="Fluck, J" uniqKey="Fluck J">J Fluck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deyati, A" uniqKey="Deyati A">A Deyati</name>
</author>
<author>
<name sortKey="Younesi, E" uniqKey="Younesi E">E Younesi</name>
</author>
<author>
<name sortKey="Hofmann Apitius, M" uniqKey="Hofmann Apitius M">M Hofmann-Apitius</name>
</author>
<author>
<name sortKey="Novac, N" uniqKey="Novac N">N Novac</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Veenstra, T" uniqKey="Veenstra T">T Veenstra</name>
</author>
<author>
<name sortKey="Conrads, T" uniqKey="Conrads T">T Conrads</name>
</author>
<author>
<name sortKey="Hood, B" uniqKey="Hood B">B Hood</name>
</author>
<author>
<name sortKey="Avellino, A" uniqKey="Avellino A">A Avellino</name>
</author>
<author>
<name sortKey="Ellenbogen, R" uniqKey="Ellenbogen R">R Ellenbogen</name>
</author>
<author>
<name sortKey="Morrison, R" uniqKey="Morrison R">R Morrison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, M" uniqKey="Zhou M">M Zhou</name>
</author>
<author>
<name sortKey="Conrads, T" uniqKey="Conrads T">T Conrads</name>
</author>
<author>
<name sortKey="Veenstra, T" uniqKey="Veenstra T">T Veenstra</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, Y" uniqKey="Lee Y">Y Lee</name>
</author>
<author>
<name sortKey="Wong, D" uniqKey="Wong D">D Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gao, K" uniqKey="Gao K">K Gao</name>
</author>
<author>
<name sortKey="Zhou, H" uniqKey="Zhou H">H Zhou</name>
</author>
<author>
<name sortKey="Zhang, L" uniqKey="Zhang L">L Zhang</name>
</author>
<author>
<name sortKey="Lee, J" uniqKey="Lee J">J Lee</name>
</author>
<author>
<name sortKey="Zhou, Q" uniqKey="Zhou Q">Q Zhou</name>
</author>
<author>
<name sortKey="Hu, S" uniqKey="Hu S">S Hu</name>
</author>
<author>
<name sortKey="Wolinsky, L" uniqKey="Wolinsky L">L Wolinsky</name>
</author>
<author>
<name sortKey="Farrell, J" uniqKey="Farrell J">J Farrell</name>
</author>
<author>
<name sortKey="Eibl, G" uniqKey="Eibl G">G Eibl</name>
</author>
<author>
<name sortKey="Wong, D" uniqKey="Wong D">D Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Xu, X" uniqKey="Xu X">X Xu</name>
</author>
<author>
<name sortKey="Veenstra, T" uniqKey="Veenstra T">T Veenstra</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Delaleu, N" uniqKey="Delaleu N">N Delaleu</name>
</author>
<author>
<name sortKey="Immervoll, H" uniqKey="Immervoll H">H Immervoll</name>
</author>
<author>
<name sortKey="Cornelius, J" uniqKey="Cornelius J">J Cornelius</name>
</author>
<author>
<name sortKey="Jonsson, R" uniqKey="Jonsson R">R Jonsson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alterovitz, G" uniqKey="Alterovitz G">G Alterovitz</name>
</author>
<author>
<name sortKey="Xiang, M" uniqKey="Xiang M">M Xiang</name>
</author>
<author>
<name sortKey="Liu, J" uniqKey="Liu J">J Liu</name>
</author>
<author>
<name sortKey="Chang, A" uniqKey="Chang A">A Chang</name>
</author>
<author>
<name sortKey="Ramoni, Mf" uniqKey="Ramoni M">MF Ramoni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Camon, E" uniqKey="Camon E">E Camon</name>
</author>
<author>
<name sortKey="Magrane, M" uniqKey="Magrane M">M Magrane</name>
</author>
<author>
<name sortKey="Barrell, D" uniqKey="Barrell D">D Barrell</name>
</author>
<author>
<name sortKey="Lee, V" uniqKey="Lee V">V Lee</name>
</author>
<author>
<name sortKey="Dimmer, E" uniqKey="Dimmer E">E Dimmer</name>
</author>
<author>
<name sortKey="Maslen, J" uniqKey="Maslen J">J Maslen</name>
</author>
<author>
<name sortKey="Binns, D" uniqKey="Binns D">D Binns</name>
</author>
<author>
<name sortKey="Harte, N" uniqKey="Harte N">N Harte</name>
</author>
<author>
<name sortKey="Lopez, R" uniqKey="Lopez R">R Lopez</name>
</author>
<author>
<name sortKey="Apweiler, R" uniqKey="Apweiler R">R Apweiler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ashburner, M" uniqKey="Ashburner M">M Ashburner</name>
</author>
<author>
<name sortKey="Ball, Ca" uniqKey="Ball C">CA Ball</name>
</author>
<author>
<name sortKey="Blake, Ja" uniqKey="Blake J">JA Blake</name>
</author>
<author>
<name sortKey="Botstein, D" uniqKey="Botstein D">D Botstein</name>
</author>
<author>
<name sortKey="Butler, H" uniqKey="Butler H">H Butler</name>
</author>
<author>
<name sortKey="Cherry, Jm" uniqKey="Cherry J">JM Cherry</name>
</author>
<author>
<name sortKey="Davis, Ap" uniqKey="Davis A">AP Davis</name>
</author>
<author>
<name sortKey="Dolinski, K" uniqKey="Dolinski K">K Dolinski</name>
</author>
<author>
<name sortKey="Dwight, Ss" uniqKey="Dwight S">SS Dwight</name>
</author>
<author>
<name sortKey="Eppig, Jt" uniqKey="Eppig J">JT Eppig</name>
</author>
<author>
<name sortKey="Harris, Ma" uniqKey="Harris M">MA Harris</name>
</author>
<author>
<name sortKey="Hill, Dp" uniqKey="Hill D">DP Hill</name>
</author>
<author>
<name sortKey="Issel Tarver, L" uniqKey="Issel Tarver L">L Issel-Tarver</name>
</author>
<author>
<name sortKey="Kasarskis, A" uniqKey="Kasarskis A">A Kasarskis</name>
</author>
<author>
<name sortKey="Lewis, S" uniqKey="Lewis S">S Lewis</name>
</author>
<author>
<name sortKey="Matese, Jc" uniqKey="Matese J">JC Matese</name>
</author>
<author>
<name sortKey="Richardson, Je" uniqKey="Richardson J">JE Richardson</name>
</author>
<author>
<name sortKey="Ringwald, M" uniqKey="Ringwald M">M Ringwald</name>
</author>
<author>
<name sortKey="Rubin, Gm" uniqKey="Rubin G">GM Rubin</name>
</author>
<author>
<name sortKey="Sherlock, G" uniqKey="Sherlock G">G Sherlock</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wheeler, Dl" uniqKey="Wheeler D">DL Wheeler</name>
</author>
<author>
<name sortKey="Barrett, T" uniqKey="Barrett T">T Barrett</name>
</author>
<author>
<name sortKey="Benson, Da" uniqKey="Benson D">DA Benson</name>
</author>
<author>
<name sortKey="Bryant, Sh" uniqKey="Bryant S">SH Bryant</name>
</author>
<author>
<name sortKey="Canese, K" uniqKey="Canese K">K Canese</name>
</author>
<author>
<name sortKey="Chetvernin, V" uniqKey="Chetvernin V">V Chetvernin</name>
</author>
<author>
<name sortKey="Church, Dm" uniqKey="Church D">DM Church</name>
</author>
<author>
<name sortKey="Dicuccio, M" uniqKey="Dicuccio M">M DiCuccio</name>
</author>
<author>
<name sortKey="Edgar, R" uniqKey="Edgar R">R Edgar</name>
</author>
<author>
<name sortKey="Federhen, S" uniqKey="Federhen S">S Federhen</name>
</author>
<author>
<name sortKey="Geer, Ly" uniqKey="Geer L">LY Geer</name>
</author>
<author>
<name sortKey="Kapustin, Y" uniqKey="Kapustin Y">Y Kapustin</name>
</author>
<author>
<name sortKey="Khovayko, O" uniqKey="Khovayko O">O Khovayko</name>
</author>
<author>
<name sortKey="Landsman, D" uniqKey="Landsman D">D Landsman</name>
</author>
<author>
<name sortKey="Lipman, Dj" uniqKey="Lipman D">DJ Lipman</name>
</author>
<author>
<name sortKey="Madden, Tl" uniqKey="Madden T">TL Madden</name>
</author>
<author>
<name sortKey="Maglott, Dr" uniqKey="Maglott D">DR Maglott</name>
</author>
<author>
<name sortKey="Ostell, J" uniqKey="Ostell J">J Ostell</name>
</author>
<author>
<name sortKey="Miller, V" uniqKey="Miller V">V Miller</name>
</author>
<author>
<name sortKey="Pruitt, Kd" uniqKey="Pruitt K">KD Pruitt</name>
</author>
<author>
<name sortKey="Schuler, Gd" uniqKey="Schuler G">GD Schuler</name>
</author>
<author>
<name sortKey="Sequeira, E" uniqKey="Sequeira E">E Sequeira</name>
</author>
<author>
<name sortKey="Sherry, St" uniqKey="Sherry S">ST Sherry</name>
</author>
<author>
<name sortKey="Sirotkin, K" uniqKey="Sirotkin K">K Sirotkin</name>
</author>
<author>
<name sortKey="Souvorov, A" uniqKey="Souvorov A">A Souvorov</name>
</author>
<author>
<name sortKey="Starchecko, G" uniqKey="Starchecko G">G Starchecko</name>
</author>
<author>
<name sortKey="Tatusov, Rl" uniqKey="Tatusov R">RL Tatusov</name>
</author>
<author>
<name sortKey="Tatusova, Ta" uniqKey="Tatusova T">TA Tatusova</name>
</author>
<author>
<name sortKey="Wagner, L" uniqKey="Wagner L">L Wagner</name>
</author>
<author>
<name sortKey="Yaschenko, E" uniqKey="Yaschenko E">E Yaschenko</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hewett, M" uniqKey="Hewett M">M Hewett</name>
</author>
<author>
<name sortKey="Oliver, De" uniqKey="Oliver D">DE Oliver</name>
</author>
<author>
<name sortKey="Rubin, Dl" uniqKey="Rubin D">DL Rubin</name>
</author>
<author>
<name sortKey="Easton, Kl" uniqKey="Easton K">KL Easton</name>
</author>
<author>
<name sortKey="Stuart, Jm" uniqKey="Stuart J">JM Stuart</name>
</author>
<author>
<name sortKey="Altman, Rb" uniqKey="Altman R">RB Altman</name>
</author>
<author>
<name sortKey="Klein, Te" uniqKey="Klein T">TE Klein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Settles, B" uniqKey="Settles B">B Settles</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Park, Yk" uniqKey="Park Y">YK Park</name>
</author>
<author>
<name sortKey="Kang, Tw" uniqKey="Kang T">TW Kang</name>
</author>
<author>
<name sortKey="Baek, Sj" uniqKey="Baek S">SJ Baek</name>
</author>
<author>
<name sortKey="Kim, Ki" uniqKey="Kim K">KI Kim</name>
</author>
<author>
<name sortKey="Kim, Sy" uniqKey="Kim S">SY Kim</name>
</author>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D Lee</name>
</author>
<author>
<name sortKey="Kim, Ys" uniqKey="Kim Y">YS Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wagner, Pd" uniqKey="Wagner P">PD Wagner</name>
</author>
<author>
<name sortKey="Srivastava, S" uniqKey="Srivastava S">S Srivastava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bigbee, Wl" uniqKey="Bigbee W">WL Bigbee</name>
</author>
<author>
<name sortKey="Gopalakrishnan, V" uniqKey="Gopalakrishnan V">V Gopalakrishnan</name>
</author>
<author>
<name sortKey="Weissfeld, Jl" uniqKey="Weissfeld J">JL Weissfeld</name>
</author>
<author>
<name sortKey="Wilson, Do" uniqKey="Wilson D">DO Wilson</name>
</author>
<author>
<name sortKey="Dacic, S" uniqKey="Dacic S">S Dacic</name>
</author>
<author>
<name sortKey="Lokshin, Ae" uniqKey="Lokshin A">AE Lokshin</name>
</author>
<author>
<name sortKey="Siegfried, Jm" uniqKey="Siegfried J">JM Siegfried</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">J Clin Bioinforma</journal-id>
<journal-id journal-id-type="iso-abbrev">J Clin Bioinforma</journal-id>
<journal-title-group>
<journal-title>Journal of Clinical Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">2043-9113</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25379168</article-id>
<article-id pub-id-type="pmc">4215335</article-id>
<article-id pub-id-type="publisher-id">2043-9113-4-13</article-id>
<article-id pub-id-type="doi">10.1186/2043-9113-4-13</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Jordan</surname>
<given-names>Rick</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>rmj12@pitt.edu</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Visweswaran</surname>
<given-names>Shyam</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>shv3@pitt.edu</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Gopalakrishnan</surname>
<given-names>Vanathi</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>vanathi@pitt.edu</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA</aff>
<aff id="I2">
<label>2</label>
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA</aff>
<aff id="I3">
<label>3</label>
Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA</aff>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>23</day>
<month>10</month>
<year>2014</year>
</pub-date>
<volume>4</volume>
<fpage>13</fpage>
<lpage>13</lpage>
<history>
<date date-type="received">
<day>26</day>
<month>6</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>2</day>
<month>10</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2014 Jordan et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2014</copyright-year>
<copyright-holder>Jordan et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.jclinbioinformatics.com/content/4/1/13"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids.</p>
</sec>
<sec>
<title>Methodology</title>
<p>A positive set of abstracts was defined by the terms ‘breast cancer’ and ‘lung cancer’ in conjunction with 14 separate ‘biofluids’ (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer.’ More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method’s performance.</p>
</sec>
<sec>
<title>Results</title>
<p>Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI’s On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI’s Genes & Disease, NCI’s Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.</p>
</sec>
</abstract>
<kwd-group>
<kwd>Literature mining</kwd>
<kwd>Text mining</kwd>
<kwd>Lung cancer</kwd>
<kwd>Breast cancer</kwd>
<kwd>Biomarker</kwd>
<kwd>Biofluid</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>The amount of scientific information has become overwhelmingly abundant, providing querying difficulties for scientists and physicians. While many data mining and literature mining methods have been described [
<xref ref-type="bibr" rid="B1">1</xref>
-
<xref ref-type="bibr" rid="B11">11</xref>
], new and innovative methods are highly desired. Articles have been written about drawing implicit connections from separate literatures [
<xref ref-type="bibr" rid="B12">12</xref>
-
<xref ref-type="bibr" rid="B15">15</xref>
], and many unidentified connections exist within publicly available material. Identifying putative disease biomarkers may lead to new connections between biofluids and diseases being discovered.</p>
<p>It is known that false positive elimination from text mining findings can be aided by the use of negative abstract sets, which are abstracts that are specifically not about the entity or relationship of interest. It is also important to examine all abstracts, both positive and negative, so that the results are comprehensive and so statistical significance measures can be accurately calculated. However, it does not seem that negative abstract sets are discussed in detail.</p>
<p>A literature search identified several biomedical text mining papers describing the use of a negative set of abstracts [
<xref ref-type="bibr" rid="B2">2</xref>
,
<xref ref-type="bibr" rid="B16">16</xref>
-
<xref ref-type="bibr" rid="B19">19</xref>
]. Implementations of negative sets of abstracts seem to be described far less than would be expected. Adamic
<italic>et al.</italic>
[
<xref ref-type="bibr" rid="B2">2</xref>
] presented a statistical approach for finding gene-disease relations. The authors described a frequency of occurrence count and an expected number of relevant abstracts vs. a random set. Gene pairs and gene symbol disambiguation results were compared to a human edited breast cancer gene database.</p>
<p>Al-Mubaid,
<italic>et al.</italic>
’s method [
<xref ref-type="bibr" rid="B16">16</xref>
] for discovering protein-to-disease associations from MEDLINE abstracts employed a protein and disease name dictionary and “positive” and “negative” sets of abstracts. The positive set consisted of abstracts relevant to a given disease, as determined by a PubMed keyword search; the negative set contained a random set of abstracts that did not mention the disease. The method identified proteins that were relevant to the disease by comparing the frequency distributions of protein names in the positive set and the overall set, which was the union of the positive and negative sets, and selected those proteins for which the distributions were significantly different statistically.</p>
<p>Andrade [
<xref ref-type="bibr" rid="B17">17</xref>
] was interested in annotating biological function of protein sequences. In this article, the ‘treatment of text with statistical methods’ was discussed. Their approach estimated the word significance from a given set of protein family abstracts by comparing each word’s abundance and distribution in a background set of varying protein family abstracts.</p>
<p>Younesi,
<italic>et al.</italic>
[
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B19">19</xref>
] divided the biomarker terminology into six concept classes (clinical management; diagnostics; prognosis; statistics; evidence; and antecedent). By including this extra level of restriction, the authors were able to significantly reduce the number of retrieved relevant documents. Frequency and entropy ranking methods were used for acquired genelists, with frequency ranking performing better overall, with their method.</p>
<p>Individual biofluids have been characterized; [
<xref ref-type="bibr" rid="B20">20</xref>
-
<xref ref-type="bibr" rid="B25">25</xref>
] however, we have found only one comprehensive comparison of more than a few biofluids. Alterovitz
<italic>et al.</italic>
[
<xref ref-type="bibr" rid="B26">26</xref>
] compared 10 biofluid proteomes to 16 tissue proteomes to determine tissue function, and tissue-specific candidate biomarkers that could be found in a given biofluid. Gene Ontology (GO); [
<xref ref-type="bibr" rid="B27">27</xref>
,
<xref ref-type="bibr" rid="B28">28</xref>
]
<ext-link ext-link-type="uri" xlink:href="http://www.geneontology.org/">http://www.geneontology.org/</ext-link>
, was used for functionality mapping, NCBI’s Online Mendelian Inheritance in Man (OMIM); [
<xref ref-type="bibr" rid="B29">29</xref>
]
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/omim/">http://www.ncbi.nlm.nih.gov/omim/</ext-link>
, for disease mapping, the Pharmacogenomics Knowledge Base (PharmGKB); [
<xref ref-type="bibr" rid="B30">30</xref>
]
<ext-link ext-link-type="uri" xlink:href="https://www.pharmgkb.org/">https://www.pharmgkb.org/</ext-link>
, for drug mapping, and a relative entropy measure was the scoring method of choice. PubMed co-citation frequencies were used to determine the overall quality of the candidate biomarkers.</p>
<p>Comparisons such as those described above have the potential to reveal critical knowledge as to which biomarkers for a disease may be detected in a given biofluid. As some biofluids are more easily obtainable than others, elimination of invasive sampling procedures is highly desirable. However, details describing which potential biomarkers can be obtained in given biofluids are not clearly defined.</p>
<p>In this paper, we developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancers, with a putative biomarker being described as a ‘gene’ or ‘protein’. 5.3 million PubMed abstracts were analysed for biomarker-disease associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). The abstract sets were further stratified among 14 biofluids. New knowledge is provided in the form of known disease biomarker lists, ranked newly discovered biomarker-disease-biofluid relationships, and biomarker specificity across biofluids. On average, (see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
) we expect true positive rates for new discoveries to be 87.5% for breast cancer, and 71.59% for lung cancer. These biomarker-disease association and accompanying z-scores will be used as informative prior values in future disease modeling activities.</p>
</sec>
<sec sec-type="methods">
<title>Methodology</title>
<sec>
<title>Automation</title>
<p>Python scripts were developed to reduce the amount of manual effort needed to achieve final scores for each potential biofluid biomarker, and to eliminate manual errors. Figure 
<xref ref-type="fig" rid="F1">1</xref>
shows a flowchart that summarizes the experimental methodology used.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Semi-automated flowchart of the information retrieval process.</bold>
Python scripts were written to process text files. ABNER was used for tagging biological entities, and the z-score calculation was performed using Microsoft Excel.</p>
</caption>
<graphic xlink:href="2043-9113-4-13-1"></graphic>
</fig>
</sec>
<sec>
<title>Information retrieval</title>
<p>For retrieving abstracts related to breast and lung cancer, a PubMed query was performed using the following limits: Abstracts, English, and Human. Query results for diseases-biofluid can be found in Table 
<xref ref-type="table" rid="T1">1</xref>
(see Additional file
<xref ref-type="supplementary-material" rid="S2">2</xref>
for Biofluid synonyms used). An abstract consists of journal entry information, title, authors, affiliations, text, copyright information, and PubMed ID. The following sets of abstracts were obtained using the selected criteria from the positive and/or negative queries (defined below).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Size of the abstract sets returned from queries of breast and lung cancer</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th colspan="3" align="center" valign="bottom">
<bold>Breast cancer</bold>
<hr></hr>
</th>
<th colspan="3" align="center" valign="bottom">
<bold>Lung cancer</bold>
<hr></hr>
</th>
</tr>
<tr>
<th align="left">
<bold>Biofluid</bold>
</th>
<th align="center">
<bold>Positives</bold>
</th>
<th align="center">
<bold>Negatives</bold>
</th>
<th align="left">
<bold>Biofluid</bold>
</th>
<th align="center">
<bold>Positives</bold>
</th>
<th align="center">
<bold>Negatives</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Bile
<hr></hr>
</td>
<td align="center" valign="bottom">360
<hr></hr>
</td>
<td align="center" valign="bottom">40,250
<hr></hr>
</td>
<td align="left" valign="bottom">Bile
<hr></hr>
</td>
<td align="center" valign="bottom">328
<hr></hr>
</td>
<td align="center" valign="bottom">40,290
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Blood
<hr></hr>
</td>
<td align="center" valign="bottom">18,939
<hr></hr>
</td>
<td align="center" valign="bottom">1,540,721
<hr></hr>
</td>
<td align="left" valign="bottom">Blood
<hr></hr>
</td>
<td align="center" valign="bottom">15,710
<hr></hr>
</td>
<td align="center" valign="bottom">1,522,046
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Breastmilk
<hr></hr>
</td>
<td align="center" valign="bottom">1,047
<hr></hr>
</td>
<td align="center" valign="bottom">17,874
<hr></hr>
</td>
<td align="left" valign="bottom">Breastmilk
<hr></hr>
</td>
<td align="center" valign="bottom">99
<hr></hr>
</td>
<td align="center" valign="bottom">18,834
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">CSF
<hr></hr>
</td>
<td align="center" valign="bottom">252
<hr></hr>
</td>
<td align="center" valign="bottom">42,711
<hr></hr>
</td>
<td align="left" valign="bottom">CSF
<hr></hr>
</td>
<td align="center" valign="bottom">298
<hr></hr>
</td>
<td align="center" valign="bottom">42,676
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Mucus
<hr></hr>
</td>
<td align="center" valign="bottom">116
<hr></hr>
</td>
<td align="center" valign="bottom">25,122
<hr></hr>
</td>
<td align="left" valign="bottom">Mucus
<hr></hr>
</td>
<td align="center" valign="bottom">1,445
<hr></hr>
</td>
<td align="center" valign="bottom">23,801
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Plasma
<hr></hr>
</td>
<td align="center" valign="bottom">4,327
<hr></hr>
</td>
<td align="center" valign="bottom">342,415
<hr></hr>
</td>
<td align="left" valign="bottom">Plasma
<hr></hr>
</td>
<td align="center" valign="bottom">3,227
<hr></hr>
</td>
<td align="center" valign="bottom">343,678
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Saliva
<hr></hr>
</td>
<td align="center" valign="bottom">149
<hr></hr>
</td>
<td align="center" valign="bottom">22,694
<hr></hr>
</td>
<td align="left" valign="bottom">Saliva
<hr></hr>
</td>
<td align="center" valign="bottom">86
<hr></hr>
</td>
<td align="center" valign="bottom">22,770
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Semen
<hr></hr>
</td>
<td align="center" valign="bottom">40
<hr></hr>
</td>
<td align="center" valign="bottom">12,956
<hr></hr>
</td>
<td align="left" valign="bottom">Semen
<hr></hr>
</td>
<td align="center" valign="bottom">9
<hr></hr>
</td>
<td align="center" valign="bottom">12,989
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Serum
<hr></hr>
</td>
<td align="center" valign="bottom">7,410
<hr></hr>
</td>
<td align="center" valign="bottom">415,218
<hr></hr>
</td>
<td align="left" valign="bottom">Serum
<hr></hr>
</td>
<td align="center" valign="bottom">6,029
<hr></hr>
</td>
<td align="center" valign="bottom">412,897
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">SF
<hr></hr>
</td>
<td align="center" valign="bottom">18
<hr></hr>
</td>
<td align="center" valign="bottom">7,699
<hr></hr>
</td>
<td align="left" valign="bottom">SF
<hr></hr>
</td>
<td align="center" valign="bottom">18
<hr></hr>
</td>
<td align="center" valign="bottom">7,671
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Stool
<hr></hr>
</td>
<td align="center" valign="bottom">123
<hr></hr>
</td>
<td align="center" valign="bottom">37,574
<hr></hr>
</td>
<td align="left" valign="bottom">Stool
<hr></hr>
</td>
<td align="center" valign="bottom">90
<hr></hr>
</td>
<td align="center" valign="bottom">37,619
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Sweat
<hr></hr>
</td>
<td align="center" valign="bottom">321
<hr></hr>
</td>
<td align="center" valign="bottom">11,079
<hr></hr>
</td>
<td align="left" valign="bottom">Sweat
<hr></hr>
</td>
<td align="center" valign="bottom">88
<hr></hr>
</td>
<td align="center" valign="bottom">11,673
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Tears
<hr></hr>
</td>
<td align="center" valign="bottom">40
<hr></hr>
</td>
<td align="center" valign="bottom">11,651
<hr></hr>
</td>
<td align="left" valign="bottom">Tears
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">11,673
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Urine
<hr></hr>
</td>
<td align="center" valign="bottom">1,154
<hr></hr>
</td>
<td align="center" valign="bottom">125,462
<hr></hr>
</td>
<td align="left" valign="bottom">Urine
<hr></hr>
</td>
<td align="center" valign="bottom">918
<hr></hr>
</td>
<td align="center" valign="bottom">86,776
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Total</td>
<td align="center">34,296</td>
<td align="center">2,653,396</td>
<td align="left">Total</td>
<td align="center">28,355</td>
<td align="center">2,595,034</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>CSF = cerebrospinal fluid; SF = synovial fluid.</p>
</table-wrap-foot>
</table-wrap>
<p>• 
<bold>
<italic>Positive Abstract Sets</italic>
</bold>
</p>
<p>• A positive abstract set is defined as the set of abstracts obtained by using the following combination of keywords, ‘breast cancer AND (biofluid)’, e.g. breast cancer AND plasma, or ‘lung cancer AND (biofluid)’. From this point forward, all positive abstract sets will be called “positive sets” for brevity. Positive set queries were performed on 4-29-2013 for breast cancer and 5-2-2013 for lung cancer. The underlying assumption being made is that any possible biomarker mentioned in these abstract sets is related to both the disease and the biofluid. Queries were returned from PubMed as large text files, and Python scripts were implemented to process the files.</p>
<p>• 
<bold>
<italic>Negative Abstract Sets</italic>
</bold>
</p>
<p>• We define a negative abstract set as a set of abstracts returned using the keywords ‘(biofluid) NOT breast cancer’ or ‘(biofluid) NOT lung cancer’. From this point forward, all negative abstract sets will be called “negative sets” for the entirety of this article. Negative set queries were performed on 4-29-2013 for breast cancer and 5-2-2013 for lung cancer. Queries were returned from PubMed as large text files, and Python scripts were implemented to process the files.</p>
</sec>
<sec>
<title>Filtering information</title>
<p>Python scripts were developed to remove unwanted punctuation and other unwanted information from the abstracts.</p>
</sec>
<sec>
<title>Named entity recognition</title>
<p>ABNER [
<xref ref-type="bibr" rid="B31">31</xref>
] (A Biomedical Named Entity Recognizer;
<ext-link ext-link-type="uri" xlink:href="http://pages.cs.wisc.edu/~bsettles/abner/">http://pages.cs.wisc.edu/~bsettles/abner/</ext-link>
) v1.5 was used to tag mentions of proteins, DNA, RNA, cell lines, and cell types in the positive and negative sets. Version 1.5 trains on the NLBPA and BioCreative corpora. Reported performance measures for ABNER are in the range of 65.9-77.8 for protein recall and 68.1-74.5 for protein precision. Our method utilizes entities tagged as “Protein”, “DNA”, and “RNA”. A batch tagging process is available and proved to be extremely useful.</p>
</sec>
<sec>
<title>Entity extraction</title>
<p>Python scripts were developed to produce a list of tagged entities from the ABNER results file (.sgml), remove unwanted characters, tags, tagged entries, and duplicate putative biomarkers from the list, and to tally the final count of each biological entity found. PubMed identifiers were retained for tracking and manual verification purposes.</p>
</sec>
<sec>
<title>Dictionary</title>
<p>A file named Protein Nomenclature was downloaded from the Human Protein Reference Database Copyright
<sup>©</sup>
2002-09, Johns Hopkins University and The Institute of Bioinformatics (Additional file
<xref ref-type="supplementary-material" rid="S3">3</xref>
), to use as a dictionary file. The file contains 19,327 unique IDs. The format consists of the HPRD id, gene symbol, RefSeq id, and aliases (separated by semi-colons). The gene symbol will be used to create a consensus name for all other aliases found. The entities were mapped via another Python script.</p>
</sec>
<sec>
<title>Scoring</title>
<p>Counts were performed at the abstract level, where a mention of a given biomarker was assigned a count of 1, regardless of the frequency of mentions within the abstract.</p>
<p>Each z-score corresponds to a point in a normal distribution and can be associated to its deviation from the mean. Z-scores were computed as follows:</p>
<p>Briefly, from Al-Mubaid [
<xref ref-type="bibr" rid="B16">16</xref>
], 
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
is the positive set of abstracts (i.e. disease/biofluid),
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
 = {
<italic>A</italic>
<sub>
<italic>1</italic>
</sub>
,
<italic>A</italic>
<sub>
<italic>2</italic>
</sub>
, …,
<italic>An</italic>
}.
<italic>A</italic>
is a given abstract,
<italic>S</italic>
<sub>
<italic>p</italic>
</sub>
is the set of proteins (markers) mentioned in the dictionary found in the positive set
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
,
<italic>S</italic>
<sub>
<italic>p</italic>
</sub>
 = {
<italic>P</italic>
<sub>
<italic>1</italic>
</sub>
,
<italic>P</italic>
<sub>
<italic>2</italic>
</sub>
, …,
<italic>P</italic>
<sub>
<italic>m</italic>
</sub>
}.
<italic>S</italic>
<sub>
<italic>2</italic>
</sub>
is the negative set of abstracts.</p>
<p>For each protein (marker)
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
in
<italic>S</italic>
<sub>
<italic>p</italic>
</sub>
, compute the document frequency (df) of
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
in both sets
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
and
<italic>S</italic>
<sub>
<italic>2</italic>
</sub>
as:</p>
<p>
<disp-formula>
<mml:math id="M1" name="2043-9113-4-13-i1" overflow="scroll">
<mml:mtable>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mn>1</mml:mn>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mtext mathvariant="normal">number</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">of</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">documents</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">in</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">which</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mspace width="5em"></mml:mspace>
<mml:mtext mathvariant="normal">is</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">mentioned</mml:mtext>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M2" name="2043-9113-4-13-i2" overflow="scroll">
<mml:mtable>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mn>2</mml:mn>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mtext mathvariant="normal">number</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">of</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">documents</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">in</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">which</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mspace width="5em"></mml:mspace>
<mml:mtext mathvariant="normal">is</mml:mtext>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="normal">mentioned</mml:mtext>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M3" name="2043-9113-4-13-i3" overflow="scroll">
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mi mathvariant="normal">t</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mn>1</mml:mn>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mn>2</mml:mn>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mtext>.</mml:mtext>
</mml:math>
</disp-formula>
</p>
<p>For each protein in the set Sp compute an expectation (ex) value and an evidence (ev) value as:</p>
<p>
<disp-formula>
<mml:math id="M4" name="2043-9113-4-13-i4" overflow="scroll">
<mml:mtext mathvariant="normal">ex</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mi mathvariant="normal">t</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo stretchy="true">/</mml:mo>
<mml:mfenced open="|" close="|">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:mfenced open="|" close="|">
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn mathvariant="italic">1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mspace width="0.5em"></mml:mspace>
<mml:mtext mathvariant="italic">and</mml:mtext>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M5" name="2043-9113-4-13-i5" overflow="scroll">
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mi mathvariant="normal">v</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mn>1</mml:mn>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
</mml:math>
</disp-formula>
</p>
<p>Ex measures expected number of mentions of
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
in the abstracts in set
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
; ev measures actual number of
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
abstracts that
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
has appeared in. The larger the difference in observed and expected document frequencies, ev(
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
) – ex(
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
), the more likely that
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
and the disease are significantly associated.</p>
<p>The difference is normalized by:</p>
<p>
<disp-formula>
<mml:math id="M6" name="2043-9113-4-13-i6" overflow="scroll">
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mi mathvariant="normal">v</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>-</mml:mo>
<mml:mtext mathvariant="normal">ex</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo stretchy="true">/</mml:mo>
<mml:mi mathvariant="normal">d</mml:mi>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mi mathvariant="normal">t</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mtext>.</mml:mtext>
</mml:math>
</disp-formula>
</p>
<p>And the z-score is calculated by:</p>
<p>
<disp-formula>
<mml:math id="M7" name="2043-9113-4-13-i7" overflow="scroll">
<mml:mi mathvariant="normal">Z</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo>-</mml:mo>
<mml:mtext mathvariant="normal">mean</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mi mathvariant="normal">f</mml:mi>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo stretchy="true">/</mml:mo>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mi mathvariant="normal">D</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mi mathvariant="normal">f</mml:mi>
</mml:mfenced>
</mml:math>
</disp-formula>
</p>
<p>where mean(f) is the mean of all f values of all proteins of
<italic>S</italic>
<sub>
<italic>p</italic>
</sub>
and SD(f) is the standard deviation of the f values.</p>
<p>A threshold value of 1.0 was established as a significance cut-off (see Figure 
<xref ref-type="fig" rid="F2">2</xref>
). These z-score values will be used as informative prior values in future modeling efforts (Additional file
<xref ref-type="supplementary-material" rid="S4">4</xref>
and Additional file
<xref ref-type="supplementary-material" rid="S5">5</xref>
).</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Number of markers identified across the range of possible Z-scores.</bold>
Decreasing the Z-score threshold allows for more significant markers to be identified.</p>
</caption>
<graphic xlink:href="2043-9113-4-13-2"></graphic>
</fig>
</sec>
<sec>
<title>Verification of relationships</title>
<p>One possible method of verification is to remove from the abstract pool, ‘verification documents’ (ones specifically pertaining to a disease-protein relationship), and use them for subsequent verification [
<xref ref-type="bibr" rid="B16">16</xref>
]. Our method allows these abstracts to remain in the pool, and verification is performed by comparing our results to a combined disease biomarker list (Additional file
<xref ref-type="supplementary-material" rid="S6">6</xref>
: Table S1 & Additional file
<xref ref-type="supplementary-material" rid="S7">7</xref>
: Table S2). The list was created using the following sources: OMIM [
<xref ref-type="bibr" rid="B29">29</xref>
] (O in table);
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/omim/">http://www.ncbi.nlm.nih.gov/omim/</ext-link>
), a cancer gene annotation system for cancer genomics [
<xref ref-type="bibr" rid="B32">32</xref>
] (CAGE(C);
<ext-link ext-link-type="uri" xlink:href="http://mgrc.kribb.re.kr/cage/pageHome.php?m=hm">http://mgrc.kribb.re.kr/cage/pageHome.php?m=hm</ext-link>
), NCBI’s Genes & Disease [
<xref ref-type="bibr" rid="B33">33</xref>
] ((G);
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/books/NBK22183/">http://www.ncbi.nlm.nih.gov/books/NBK22183/</ext-link>
), NCI’s Early Detection Research Network [
<xref ref-type="bibr" rid="B34">34</xref>
] (EDRN (E);
<ext-link ext-link-type="uri" xlink:href="http://edrn.nci.nih.gov/">http://edrn.nci.nih.gov/</ext-link>
), an expert provided list (X) of validated cancer markers [
<xref ref-type="bibr" rid="B35">35</xref>
], and a recently released breast cancer paper [
<xref ref-type="bibr" rid="B36">36</xref>
] (P). Markers that are present in at least one of these lists, as well as in our dictionary were considered verified. The list for breast cancer was compiled using OMIM, CAGE, Genes & Disease, the expert provided list, and the previously mentioned paper. The lung cancer list was compiled from OMIM, CAGE, EDRN, and the expert provided list.</p>
</sec>
<sec>
<title>True positive rate determination</title>
<p>Negative abstracts were utilized to initially eliminate some false positives. However, it is more likely than not, that this process alone will not completely eliminate all false positives.</p>
<p>In processing the abstracts, it was apparent that eventually manual examination of abstracts would be required for result verification. The abstract PubMed identifier of every possible instance of every biomarker mention accompanied each biomarker, allowing for manual tracking and further verification of our results. Relevant abstracts were investigated further. Three criteria were used for a pass/fail outcome. Abstracts were examined for mentions of biomarker, disease, and biofluid. All three criteria were required to be acceptable, and synonyms and/or root words were deemed adequate (e.g. biliary instead of bile).</p>
</sec>
</sec>
<sec sec-type="results">
<title>Results</title>
<sec>
<title>Positive and negative sets</title>
<p>Table 
<xref ref-type="table" rid="T1">1</xref>
describes the number of relevant abstracts obtained from the PubMed searches. Fourteen biofluids were evaluated. From this table, blood, plasma, and serum returned the most positive and negative abstracts from both breast and lung cancer queries. Over five million total abstracts were examined.</p>
</sec>
<sec>
<title>Known markers per biofluid</title>
<p>Our known marker lists are combinations of several ‘biomarker lists’ obtained from well-known databases. The known breast cancer marker list contains 211 gene symbols that mapped to our dictionary (Additional file
<xref ref-type="supplementary-material" rid="S6">6</xref>
: Table S1; 159 found in this exercise), and the known lung cancer marker list has 209 markers that mapped to our dictionary (Additional file
<xref ref-type="supplementary-material" rid="S7">7</xref>
: Table S2; 145 found in this exercise). Known marker results presented in Table 
<xref ref-type="table" rid="T2">2</xref>
were obtained by identifying putative biomarkers with a z-score exceeding the significance threshold (>1.0), and confirming the gene symbol in our known disease biomarker list. Table 
<xref ref-type="table" rid="T2">2</xref>
also summarizes the biofluids that produced markers with significant z-scores and/or the number of known markers found for breast and lung cancer.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>Number of markers identified for each disease-biofluid combination</p>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="top">
<tr>
<th align="left">
<bold>Breast Cancer</bold>
</th>
<th align="center">
<bold>Total number of markers found</bold>
</th>
<th align="center">
<bold>Known markers found (211 possible)</bold>
</th>
<th align="center">
<bold>Markers producing a significant z-score (>1.0)</bold>
</th>
<th align="center">
<bold>Known markers with a significant z-score</bold>
</th>
<th align="center">
<bold>New markers with a significant z-score</bold>
</th>
<th align="center">
<bold>% new discoveries</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="bottom">Bile
<hr></hr>
</td>
<td align="center" valign="bottom">200
<hr></hr>
</td>
<td align="center" valign="bottom">26
<hr></hr>
</td>
<td align="center" valign="bottom">58
<hr></hr>
</td>
<td align="center" valign="bottom">7
<hr></hr>
</td>
<td align="center" valign="bottom">51
<hr></hr>
</td>
<td align="center" valign="bottom">87.93
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Blood
<hr></hr>
</td>
<td align="center" valign="bottom">2084
<hr></hr>
</td>
<td align="center" valign="bottom">150
<hr></hr>
</td>
<td align="center" valign="bottom">196
<hr></hr>
</td>
<td align="center" valign="bottom">9
<hr></hr>
</td>
<td align="center" valign="bottom">187
<hr></hr>
</td>
<td align="center" valign="bottom">95.41
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Breastmilk
<hr></hr>
</td>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
<td align="center" valign="bottom"> 
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">CSF
<hr></hr>
</td>
<td align="center" valign="bottom">116
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">18
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">18
<hr></hr>
</td>
<td align="center" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Mucus
<hr></hr>
</td>
<td align="center" valign="bottom">63
<hr></hr>
</td>
<td align="center" valign="bottom">13
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">5
<hr></hr>
</td>
<td align="center" valign="bottom">62.50
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Plasma
<hr></hr>
</td>
<td align="center" valign="bottom">1002
<hr></hr>
</td>
<td align="center" valign="bottom">88
<hr></hr>
</td>
<td align="center" valign="bottom">100
<hr></hr>
</td>
<td align="center" valign="bottom">5
<hr></hr>
</td>
<td align="center" valign="bottom">95
<hr></hr>
</td>
<td align="center" valign="bottom">95.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Saliva
<hr></hr>
</td>
<td align="center" valign="bottom">73
<hr></hr>
</td>
<td align="center" valign="bottom">9
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">80.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Semen
<hr></hr>
</td>
<td align="center" valign="bottom">35
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">6
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">6
<hr></hr>
</td>
<td align="center" valign="bottom">100
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Serum
<hr></hr>
</td>
<td align="center" valign="bottom">1327
<hr></hr>
</td>
<td align="center" valign="bottom">106
<hr></hr>
</td>
<td align="center" valign="bottom">145
<hr></hr>
</td>
<td align="center" valign="bottom">6
<hr></hr>
</td>
<td align="center" valign="bottom">139
<hr></hr>
</td>
<td align="center" valign="bottom">95.86
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">SF
<hr></hr>
</td>
<td align="center" valign="bottom">21
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">4
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">4
<hr></hr>
</td>
<td align="center" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Stool
<hr></hr>
</td>
<td align="center" valign="bottom">68
<hr></hr>
</td>
<td align="center" valign="bottom">8
<hr></hr>
</td>
<td align="center" valign="bottom">7
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">4
<hr></hr>
</td>
<td align="center" valign="bottom">57.14
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Sweat
<hr></hr>
</td>
<td align="center" valign="bottom">123
<hr></hr>
</td>
<td align="center" valign="bottom">15
<hr></hr>
</td>
<td align="center" valign="bottom">28
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">25
<hr></hr>
</td>
<td align="center" valign="bottom">89.29
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Tears
<hr></hr>
</td>
<td align="center" valign="bottom">26
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Urine
<hr></hr>
</td>
<td align="center" valign="bottom">310
<hr></hr>
</td>
<td align="center" valign="bottom">32
<hr></hr>
</td>
<td align="center" valign="bottom">38
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">35
<hr></hr>
</td>
<td align="center" valign="bottom">92.11
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">
<bold>Lung Cancer</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">
<bold>Total number of markers found</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">
<bold>Known markers found (211 possible)</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">
<bold>Markers producing a significant z-score (>1.0)</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">
<bold>Known markers with a significant z-score</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">
<bold>New markers with a significant z-score</bold>
<hr></hr>
</td>
<td align="center" valign="bottom">
<bold>% new discoveries</bold>
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Bile
<hr></hr>
</td>
<td align="center" valign="bottom">167
<hr></hr>
</td>
<td align="center" valign="bottom">17
<hr></hr>
</td>
<td align="center" valign="bottom">25
<hr></hr>
</td>
<td align="center" valign="bottom">1
<hr></hr>
</td>
<td align="center" valign="bottom">24
<hr></hr>
</td>
<td align="center" valign="bottom">96.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Blood
<hr></hr>
</td>
<td align="center" valign="bottom">1863
<hr></hr>
</td>
<td align="center" valign="bottom">141
<hr></hr>
</td>
<td align="center" valign="bottom">152
<hr></hr>
</td>
<td align="center" valign="bottom">7
<hr></hr>
</td>
<td align="center" valign="bottom">145
<hr></hr>
</td>
<td align="center" valign="bottom">95.39
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Breastmilk
<hr></hr>
</td>
<td align="center" valign="bottom">77
<hr></hr>
</td>
<td align="center" valign="bottom">15
<hr></hr>
</td>
<td align="center" valign="bottom">11
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">9
<hr></hr>
</td>
<td align="center" valign="bottom">81.82
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">CSF
<hr></hr>
</td>
<td align="center" valign="bottom">106
<hr></hr>
</td>
<td align="center" valign="bottom">7
<hr></hr>
</td>
<td align="center" valign="bottom">11
<hr></hr>
</td>
<td align="center" valign="bottom">1
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">90.91
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Mucus
<hr></hr>
</td>
<td align="center" valign="bottom">276
<hr></hr>
</td>
<td align="center" valign="bottom">27
<hr></hr>
</td>
<td align="center" valign="bottom">73
<hr></hr>
</td>
<td align="center" valign="bottom">10
<hr></hr>
</td>
<td align="center" valign="bottom">63
<hr></hr>
</td>
<td align="center" valign="bottom">86.30
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Plasma
<hr></hr>
</td>
<td align="center" valign="bottom">843
<hr></hr>
</td>
<td align="center" valign="bottom">75
<hr></hr>
</td>
<td align="center" valign="bottom">65
<hr></hr>
</td>
<td align="center" valign="bottom">4
<hr></hr>
</td>
<td align="center" valign="bottom">61
<hr></hr>
</td>
<td align="center" valign="bottom">93.85
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Saliva
<hr></hr>
</td>
<td align="center" valign="bottom">53
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">7
<hr></hr>
</td>
<td align="center" valign="bottom">1
<hr></hr>
</td>
<td align="center" valign="bottom">6
<hr></hr>
</td>
<td align="center" valign="bottom">85.71
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Semen
<hr></hr>
</td>
<td align="center" valign="bottom">11
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Serum
<hr></hr>
</td>
<td align="center" valign="bottom">1109
<hr></hr>
</td>
<td align="center" valign="bottom">100
<hr></hr>
</td>
<td align="center" valign="bottom">103
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">100
<hr></hr>
</td>
<td align="center" valign="bottom">97.09
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">SF
<hr></hr>
</td>
<td align="center" valign="bottom">13
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">3
<hr></hr>
</td>
<td align="center" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Stool
<hr></hr>
</td>
<td align="center" valign="bottom">45
<hr></hr>
</td>
<td align="center" valign="bottom">2
<hr></hr>
</td>
<td align="center" valign="bottom">5
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">5
<hr></hr>
</td>
<td align="center" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Sweat
<hr></hr>
</td>
<td align="center" valign="bottom">44
<hr></hr>
</td>
<td align="center" valign="bottom">5
<hr></hr>
</td>
<td align="center" valign="bottom">4
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">4
<hr></hr>
</td>
<td align="center" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left" valign="bottom">Tears
<hr></hr>
</td>
<td align="center" valign="bottom">12
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">1
<hr></hr>
</td>
<td align="center" valign="bottom">0
<hr></hr>
</td>
<td align="center" valign="bottom">1
<hr></hr>
</td>
<td align="center" valign="bottom">100.00
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Urine</td>
<td align="center">256</td>
<td align="center">30</td>
<td align="center">56</td>
<td align="center">6</td>
<td align="center">50</td>
<td align="center">89.29</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Known markers were determined by identification of the given gene symbol in our known biomarker lists (Additional file
<xref ref-type="supplementary-material" rid="S6">6</xref>
: Table S1 or Additional file
<xref ref-type="supplementary-material" rid="S7">7</xref>
: Table S2). Significant markers had a z-score >1.0.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Z- score threshold optimization</title>
<p>We chose an appropriate threshold for z-score based on empirical findings. As shown in Figure 
<xref ref-type="fig" rid="F2">2</xref>
which is a plot of the number of known markers and new markers (log
<sub>10</sub>
) based on the z-score threshold which was varied between 1 and 4 in increments of 0.5. Based on this we chose a non-stringent z-score threshold of 1.0 which allows us to identify the maximum number of known and new markers.</p>
</sec>
<sec>
<title>Comparison of identification of potential biomarkers by disease-biofluid</title>
<p>Table 
<xref ref-type="table" rid="T2">2</xref>
shows the breakdown of the number of markers found by our method. In most biofluids, the number found in breast cancer outnumbers the number found in lung cancer, with the exceptions being breastmilk (removed from our breast cancer examination due to both positive and negative search terms containing the root ‘breast’) and mucus (greater association with respiratory system).</p>
</sec>
<sec>
<title>Known markers found significant vs. non-significant</title>
<p>While the truth is unknown as to the members of the comprehensive pool of breast or lung cancer biomarkers, and thus a true positive value cannot be obtained, estimates can be made. Although these numbers are not shown, one can easily calculate the percentage of known markers identified as significant vs. not-significant using the counts from Table 
<xref ref-type="table" rid="T2">2</xref>
.</p>
<p>For breast cancer, percentages range from 5% in plasma and serum to 37.5% in stool (for biofluids with known-significant markers; non-zero). In lung cancer the range is from 3% in serum to 37% in mucus.</p>
</sec>
<sec>
<title>Newly discovered markers found significant vs. non-significant</title>
<p>The percentage of newly discovered markers (markers not found in known marker list) that were found to be significant vs. the percentage that were identified but not found to be significant was calculated.</p>
<p>For breast cancer, percentages range from 6.67% in stool to 29.3% in bile (for biofluids with known-significant markers; non-zero). In lung cancer the range is from 7.9% in plasma to 27.2% in synovial fluid.</p>
</sec>
<sec>
<title>Potential marker biofluid specificity</title>
<p>Biomarker commonality and specificity was sought across biofluids. This was a significant finding in that we have not seen many potential biomarker comparisons across more than a few biofluids. Additional file
<xref ref-type="supplementary-material" rid="S8">8</xref>
: Table S3 shows the known + significant biomarkers within biofluids for breast and lung cancer.</p>
<p>A total of 21 known + significant markers were identified for breast cancer. Nine biofluids produced known ID’s with significant scores. A breakdown of this list shows that 14 are only identified in combination with one biofluid, 3 with two biofluids, 1 with 3 biofluids (ERBB2; mentioned blood, plasma, and serum), 1 with 4 biofluids (NCOA3; mentioned in bile, blood, plasma, and serum), 1 with 6 biofluids (BRCA2; mentioned in bile, blood, mucus, saliva, serum, and sweat), and 1 with 7 biofluids (BRCA1; mentioned in blood, mucus, plasma, saliva, serum, sweat, and urine abstracts).</p>
<p>A total of 26 known + significant putative markers were identified for lung cancer. Eight biofluids produced known ID’s with significant scores. A breakdown of this list shows that 21 are only mentioned in combination with one biofluid, 3 with two biofluids, 1 with 3 biofluids (EML4; mentioned in blood, mucus, and serum), and 1 with 4 biofluids (KRAS; mentioned in blood, breastmilk, mucus, and serum).</p>
</sec>
<sec>
<title>Manual verification of findings</title>
<p>A manual check of relevant abstracts was performed to ensure the reliability of our results. Each relevant PubMed abstract was manually examined to verify the biomarker mentioned. The results of this manual verification can be seen in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S4. Four known biomarkers (CHEK2 in both plasma and urine, CDKN1B, PCNA, and THBS1) were identified as false positives (red) in our breast cancer list, and seven (KRAS, GDNF in both breastmilk and plasma, MYCL1 in both blood and serum, CD40LG, CGA, CTAG1A, ERCC6, and HRAS) in our lung cancer list. KRAS is interesting in that it produced a false positive in association with breastmilk, but had verified positive findings in associations with blood, mucus, and serum.</p>
</sec>
<sec>
<title>True positive rate estimation of new discoveries</title>
<p>Manual verification allowed us to calculate the true positive rates across the biofluids-diseases. The results found in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
: Table S4 show an average error rate for breast cancer of 12.5%, and an average lung cancer error rate of 29.41%. From these calculations, one can conclude that 87.5% of the breast cancer new discoveries would be true positives, and 70.59% of the lung cancer new discoveries would be true positives.</p>
</sec>
</sec>
<sec sec-type="discussion">
<title>Discussion</title>
<p>We have presented a method to determine the possibility of relatedness between potential biomarkers in biofluids and disease (breast and lung cancers), using positive and negative sets of abstracts and a z-score.</p>
<p>Error exists in ABNER’s [
<xref ref-type="bibr" rid="B31">31</xref>
] tagging, our dictionary consensus, and possibly anywhere manual processing of the data occurs. Negation was not addressed at this time.</p>
<p>A potential dictionary problem was identified in that some members of a protein family had a generic alias in common. This led to results such as ceacam5 and ceacam8 both being identified for the CEA alias. Adding another unique ID such as “ceacam_family” to account for this double counting was considered, however it was decided to let the counts stand, as there may be double counting elsewhere in the dictionary of which we are unaware.</p>
<p>In some situations a potential biomarker may need to only be mentioned in one negative set abstract to exhibit non-significance by our method. As disease-specific potential markers are sought, common biomarkers implicated in several diseases may not reach a significant score by our method because of their mention in abstracts describing other diseases including other types of cancer.</p>
<p>A requirement for potential biomarkers to appear in different abstracts was not applied. Several biomarker mentions may come from the same abstract. Similarly, there was not a requirement for different biofluids to appear in different abstracts. One biomarker discussed in association with more than one biofluid may appear in the list for each biofluid.</p>
<p>The number of known cancer biomarkers found but deemed not significant was reported. The results may be due to the way the negative search space was defined. It is possible that abstracts of other cancers or diseases exist in our negative set, and thus any biomarker mentioned in association with any other disease would negate our positive findings for breast and/or lung cancer.</p>
<p>Databases used for verification are probably far from being complete, which could be why our list of known + significant biomarkers is smaller than expected. Another explanation could be that certain markers just may not be found in a given biofluid. We will work to improve our verification methods over time.</p>
<p>Lastly, only abstracts were examined in this work. Obviously, full text examination would produce more findings as well as more confidence in the findings, but access to full text remains a limiting factor for all text-mining researchers.</p>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>We have presented a method that utilizes literature mining to create a list of documented putative biomarker-biofluid relationships for breast and lung cancer. Over 5 million abstracts were analyzed for biomarker-disease associations. These abstract sets were further stratified among 14 biofluids. Some false positives were initially eliminated by examining negative sets of abstracts and establishing a threshold z-score. New knowledge pertaining to breast and lung cancer is presented in the forms of known disease biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids. The relationships obtained from literature mining were verified by comparison to well-known published databases. Manual examination of abstracts allowed for known relationship verification and true positive rate calculations. On average, we can expect an 87.5% true positive rate for our breast cancer new discoveries, and a 71.59% true positive rate for our lung cancer new discoveries.</p>
<p>Future work in this area will include further automation of our semi-automated process, applying our method to other diseases, assembling a disease database to make our z-score findings available to others, as well as converting our z-score values into prior probabilities for use as informative priors in Bayesian disease modeling.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors’ contributions</title>
<p>RJ wrote the Python scripts, downloaded abstracts, performed analysis, created figures and tables. VG conceived of the study, participated in its design and coordination. SV provided methodology and participated in study design. All authors participated in drafting the manuscript as well as reading and approving the final manuscript.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1: Table S4</title>
<p>Manually verified biomarker table. Biomarker specific abstracts were manually examined for accuracy. Abstracts were examined for mentions of biofluid, disease, and biomarker. Lack of any one term resulted in a ‘false positive’ result.</p>
</caption>
<media xlink:href="2043-9113-4-13-S1.docx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional file 2</title>
<p>SupplementaryBiofluidTable.</p>
</caption>
<media xlink:href="2043-9113-4-13-S2.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S3">
<caption>
<title>Additional file 3</title>
<p>SupplementaryProteinlist.</p>
</caption>
<media xlink:href="2043-9113-4-13-S3.txt">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S4">
<caption>
<title>Additional file 4</title>
<p>SupplementaryBCResults.</p>
</caption>
<media xlink:href="2043-9113-4-13-S4.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S5">
<caption>
<title>Additional file 5</title>
<p>SupplementaryLCResults.</p>
</caption>
<media xlink:href="2043-9113-4-13-S5.xlsx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S6">
<caption>
<title>Additional file 6: Table S1</title>
<p>List of breast cancer identifiers.</p>
</caption>
<media xlink:href="2043-9113-4-13-S6.docx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S7">
<caption>
<title>Additional file 7: Table S2</title>
<p>List of lung cancer identifiers.</p>
</caption>
<media xlink:href="2043-9113-4-13-S7.docx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S8">
<caption>
<title>Additional file 8: Table S3</title>
<p>
<bold>Identification of the significant validated potential markers found to be in common to several biofluids or biofluid specific for breast and lung cancer.</bold>
Biomarkers highlighted in yellow are either breast cancer markers found in the list of validated lung cancer biomarkers (Additional file
<xref ref-type="supplementary-material" rid="S7">7</xref>
: Table S2), or lung cancer markers found in the list of validated breast cancer biomarkers (Additional file
<xref ref-type="supplementary-material" rid="S6">6</xref>
: Table S1). It is doubtful that these markers are disease specific. CDH1 is the only found biomarker in both cancer lists.</p>
</caption>
<media xlink:href="2043-9113-4-13-S8.docx">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>The research reported in this publication was partially supported by the following grants from the National Institutes of Health: National Library of Medicine Award Number R01LM010950 (to VG), and National Institute of General Medical Sciences Award Number R01GM100387 (to VG) and National Cancer Institute Award Number P50CA090440. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Hirschman</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Tsujii</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>CH</given-names>
</name>
<article-title>Accomplishments and challenges in literature data mining for biology</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>1553</fpage>
<lpage>1561</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.12.1553</pub-id>
<pub-id pub-id-type="pmid">12490438</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Adamic</surname>
<given-names>LA</given-names>
</name>
<name>
<surname>Wilkinson</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Huberman</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Adar</surname>
<given-names>E</given-names>
</name>
<article-title>A literature based method for identifying gene-disease connections</article-title>
<source>Proc IEEE Comput Soc Bioinform Conf</source>
<year>2002</year>
<volume>1</volume>
<fpage>109</fpage>
<lpage>117</lpage>
<pub-id pub-id-type="pmid">15838128</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Wren</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Bekeredjian</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Stewart</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Shohet</surname>
<given-names>RV</given-names>
</name>
<name>
<surname>Garner</surname>
<given-names>HR</given-names>
</name>
<article-title>Knowledge discovery by automated identification and ranking of implicit relationships</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>389</fpage>
<lpage>398</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg421</pub-id>
<pub-id pub-id-type="pmid">14960466</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Xuan</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Watson</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>F</given-names>
</name>
<article-title>Medline search engine for finding genetic markers with biological significance</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>2477</fpage>
<lpage>2484</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm375</pub-id>
<pub-id pub-id-type="pmid">17823133</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Hristovski</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Peterlin</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Humphrey</surname>
<given-names>SM</given-names>
</name>
<article-title>Using literature-based discovery to identify disease candidate genes</article-title>
<source>Int J Med Inform</source>
<year>2005</year>
<volume>74</volume>
<fpage>289</fpage>
<lpage>298</lpage>
<pub-id pub-id-type="doi">10.1016/j.ijmedinf.2004.04.024</pub-id>
<pub-id pub-id-type="pmid">15694635</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Novichkova</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Egorov</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Daraseila</surname>
<given-names>N</given-names>
</name>
<article-title>MedScan, a natural language processing engine for MEDLINE abstracts</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>1699</fpage>
<lpage>1706</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg207</pub-id>
<pub-id pub-id-type="pmid">12967967</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Srinivasan</surname>
<given-names>P</given-names>
</name>
<article-title>Text mining: generating hypotheses from MEDLINE</article-title>
<source>J Am Soc Inform Sci Technol</source>
<year>2004</year>
<volume>55</volume>
<fpage>396</fpage>
<lpage>413</lpage>
<pub-id pub-id-type="doi">10.1002/asi.10389</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Leonard</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Colombe</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Levy</surname>
<given-names>JL</given-names>
</name>
<article-title>Finding relevant references to genes and proteins in Medline using a Bayesian approach</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<fpage>1515</fpage>
<lpage>1522</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.11.1515</pub-id>
<pub-id pub-id-type="pmid">12424124</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Saric</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
<article-title>Literature mining for the biologist: from information retrieval to biological discovery</article-title>
<source>Nat Rev Genet</source>
<year>2006</year>
<volume>7</volume>
<fpage>119</fpage>
<lpage>129</lpage>
<pub-id pub-id-type="doi">10.1038/nrg1768</pub-id>
<pub-id pub-id-type="pmid">16418747</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Krallinger</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Valencia</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hirschman</surname>
<given-names>L</given-names>
</name>
<article-title>Linking genes to literature: text mining, information extraction, and retrieval applications for biology</article-title>
<source>Genome Biol</source>
<year>2008</year>
<volume>9</volume>
<issue>Suppl.2</issue>
<fpage>S8</fpage>
<pub-id pub-id-type="pmid">18834499</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Cohen</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Hersh</surname>
<given-names>WR</given-names>
</name>
<article-title>A survey of current work in biomedical text mining</article-title>
<source>Brief Bioinform</source>
<year>2005</year>
<volume>6</volume>
<fpage>57</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="doi">10.1093/bib/6.1.57</pub-id>
<pub-id pub-id-type="pmid">15826357</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>Swanson</surname>
<given-names>DR</given-names>
</name>
<article-title>Medical literature as a potential source of new knowledge</article-title>
<source>Bull Med Libr Assoc</source>
<year>1990</year>
<volume>78</volume>
<fpage>29</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="pmid">2403828</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Zhu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Okuno</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tsujimoto</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mamitsuka</surname>
<given-names>H</given-names>
</name>
<article-title>Application of a new probabilistic model for mining implicit associated cancer genes from OMIM and Medline</article-title>
<source>Cancer Inform</source>
<year>2006</year>
<volume>2</volume>
<fpage>361</fpage>
<lpage>371</lpage>
<pub-id pub-id-type="pmid">19458778</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Frijters</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Van Vugt</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Smeets</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Van Schaik</surname>
<given-names>R</given-names>
</name>
<name>
<surname>De Vlieg</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Alkema</surname>
<given-names>W</given-names>
</name>
<article-title>Literature mining for the discovery of hidden connections between drugs, genes and diseases</article-title>
<source>PLoS Comput Biol</source>
<year>2010</year>
<volume>6</volume>
<fpage>e1000943</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1000943</pub-id>
<pub-id pub-id-type="pmid">20885778</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C</given-names>
</name>
<article-title>Biomarker identification using text mining</article-title>
<source>Comput Math Methods Med</source>
<year>2012</year>
<volume>2012</volume>
<fpage>135780</fpage>
<pub-id pub-id-type="pmid">23197989</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Al-Mubaid</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>RK</given-names>
</name>
<article-title>A new text mining approach for finding protein-to-disease associations</article-title>
<source>Am J Biochem Biotechnol</source>
<year>2005</year>
<volume>1</volume>
<fpage>145</fpage>
<lpage>152</lpage>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Andrade</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Valencia</surname>
<given-names>A</given-names>
</name>
<article-title>Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families</article-title>
<source>Bioinformatics</source>
<year>1998</year>
<volume>14</volume>
<fpage>600</fpage>
<lpage>607</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/14.7.600</pub-id>
<pub-id pub-id-type="pmid">9730925</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Younesi</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Toldo</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Muller</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Friedrich</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Novac</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Scheer</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hofmann-Apitius</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fluck</surname>
<given-names>J</given-names>
</name>
<article-title>Mining biomarker information in biomedical literature</article-title>
<source>BMC Med Inform Decis Mak</source>
<year>2012</year>
<volume>12</volume>
<fpage>148</fpage>
<pub-id pub-id-type="doi">10.1186/1472-6947-12-148</pub-id>
<pub-id pub-id-type="pmid">23249606</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Deyati</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Younesi</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hofmann-Apitius</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Novac</surname>
<given-names>N</given-names>
</name>
<article-title>Challenges and opportunities for oncology biomarker discovery</article-title>
<source>Drug Discov Today</source>
<year>2012</year>
<volume>18</volume>
<fpage>614</fpage>
<lpage>624</lpage>
<pub-id pub-id-type="pmid">23280501</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Veenstra</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Conrads</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Hood</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Avellino</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ellenbogen</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Morrison</surname>
<given-names>R</given-names>
</name>
<article-title>Biomarkers: mining the biofluid proteome</article-title>
<source>Mol Cell Proteomics</source>
<year>2005</year>
<volume>4</volume>
<fpage>409</fpage>
<lpage>418</lpage>
<pub-id pub-id-type="doi">10.1074/mcp.M500006-MCP200</pub-id>
<pub-id pub-id-type="pmid">15684407</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Zhou</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Conrads</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Veenstra</surname>
<given-names>T</given-names>
</name>
<article-title>Proteomics approaches to biomarker detection</article-title>
<source>Brief Funct Genom Proteomics</source>
<year>2005</year>
<volume>4</volume>
<fpage>69</fpage>
<lpage>75</lpage>
<pub-id pub-id-type="doi">10.1093/bfgp/4.1.69</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Lee</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>D</given-names>
</name>
<article-title>Saliva: An emerging biofluid for early detection of diseases</article-title>
<source>Am J Dent</source>
<year>2009</year>
<volume>22</volume>
<fpage>241</fpage>
<lpage>248</lpage>
<pub-id pub-id-type="pmid">19824562</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Gao</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wolinsky</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Farrell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Eibl</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>D</given-names>
</name>
<article-title>Systemic disease-induced salivary biomarker profiles in mouse models of melanoma and non-small cell lung cancer</article-title>
<source>PLoS One</source>
<year>2009</year>
<volume>4</volume>
<fpage>e5875</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0005875</pub-id>
<pub-id pub-id-type="pmid">19517020</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Xu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Veenstra</surname>
<given-names>T</given-names>
</name>
<article-title>Analysis of biofluids for biomarker research</article-title>
<source>Proteomics Clin Appl</source>
<year>2008</year>
<volume>2</volume>
<fpage>1403</fpage>
<lpage>1412</lpage>
<pub-id pub-id-type="doi">10.1002/prca.200780173</pub-id>
<pub-id pub-id-type="pmid">21136789</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Delaleu</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Immervoll</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Cornelius</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jonsson</surname>
<given-names>R</given-names>
</name>
<article-title>Biomarker profiles in serum and saliva of experimental Sjogren’s syndrome: associations with specific autoimmune manifestations</article-title>
<source>Arthritis Res Ther</source>
<year>2008</year>
<volume>10</volume>
<fpage>R22</fpage>
<pub-id pub-id-type="doi">10.1186/ar2375</pub-id>
<pub-id pub-id-type="pmid">18289371</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="other">
<name>
<surname>Alterovitz</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Xiang</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ramoni</surname>
<given-names>MF</given-names>
</name>
<article-title>System-wide peripheral biomarker discovery using information theory</article-title>
<source>Pac Symp Biocomput</source>
<year>2008</year>
<fpage>231</fpage>
<lpage>242</lpage>
<pub-id pub-id-type="pmid">18229689</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Camon</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Magrane</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Barrell</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Dimmer</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Maslen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Binns</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Harte</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Lopez</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<article-title>The Gene Ontology Annotation (GOA) database: sharing knowledge in uniprot with gene ontology</article-title>
<source>Nucleic Acids Res</source>
<year>2004</year>
<volume>32</volume>
<issue>Database issue</issue>
<fpage>D262</fpage>
<lpage>D266</lpage>
<pub-id pub-id-type="pmid">14681408</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="journal">
<name>
<surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ball</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Blake</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Botstein</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Butler</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Cherry</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>AP</given-names>
</name>
<name>
<surname>Dolinski</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Dwight</surname>
<given-names>SS</given-names>
</name>
<name>
<surname>Eppig</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Hill</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Issel-Tarver</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Kasarskis</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Matese</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Richardson</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Ringwald</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>Sherlock</surname>
<given-names>G</given-names>
</name>
<article-title>Gene ontology: tool for the unification of biology. The Gene Ontology Consortium</article-title>
<source>Nat Genet</source>
<year>2000</year>
<volume>25</volume>
<issue>1</issue>
<fpage>25</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="doi">10.1038/75556</pub-id>
<pub-id pub-id-type="pmid">10802651</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="journal">
<name>
<surname>Wheeler</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Barrett</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Benson</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>SH</given-names>
</name>
<name>
<surname>Canese</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chetvernin</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>DiCuccio</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Edgar</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Federhen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Geer</surname>
<given-names>LY</given-names>
</name>
<name>
<surname>Kapustin</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Khovayko</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Landsman</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Maglott</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Ostell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Pruitt</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Schuler</surname>
<given-names>GD</given-names>
</name>
<name>
<surname>Sequeira</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Sherry</surname>
<given-names>ST</given-names>
</name>
<name>
<surname>Sirotkin</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Souvorov</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Starchecko</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Tatusov</surname>
<given-names>RL</given-names>
</name>
<name>
<surname>Tatusova</surname>
<given-names>TA</given-names>
</name>
<name>
<surname>Wagner</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Yaschenko</surname>
<given-names>E</given-names>
</name>
<article-title>Database resources of the national center for biotechnology information</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>Database issue</issue>
<fpage>D5</fpage>
<lpage>D12</lpage>
<comment>Epub 2006 Dec 14</comment>
<pub-id pub-id-type="pmid">17170002</pub-id>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Hewett</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Oliver</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Easton</surname>
<given-names>KL</given-names>
</name>
<name>
<surname>Stuart</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Altman</surname>
<given-names>RB</given-names>
</name>
<name>
<surname>Klein</surname>
<given-names>TE</given-names>
</name>
<article-title>PharmGKB: the pharmacogenetics knowledge base</article-title>
<source>Nucleic Acids Res</source>
<year>2002</year>
<volume>30</volume>
<issue>1</issue>
<fpage>163</fpage>
<lpage>165</lpage>
<pub-id pub-id-type="doi">10.1093/nar/30.1.163</pub-id>
<pub-id pub-id-type="pmid">11752281</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal">
<name>
<surname>Settles</surname>
<given-names>B</given-names>
</name>
<article-title>ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>3191</fpage>
<lpage>3192</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti475</pub-id>
<pub-id pub-id-type="pmid">15860559</pub-id>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Park</surname>
<given-names>YK</given-names>
</name>
<name>
<surname>Kang</surname>
<given-names>TW</given-names>
</name>
<name>
<surname>Baek</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>KI</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>SY</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>YS</given-names>
</name>
<article-title>CaGe: a web-based cancer gene annotation system for cancer genomics</article-title>
<source>Genom Inform</source>
<year>2012</year>
<volume>10</volume>
<issue>1</issue>
<fpage>33</fpage>
<lpage>39</lpage>
<comment>Epub 2012 Mar 31</comment>
<pub-id pub-id-type="doi">10.5808/GI.2012.10.1.33</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="book">
<collab>National Center for Biotechnology Information (US)</collab>
<source>Genes and Disease [Internet]</source>
<year>1998</year>
<publisher-name>Bethesda (MD): National Center for Biotechnology Information (US)</publisher-name>
<comment>Available from:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/books/NBK22183/">http://www.ncbi.nlm.nih.gov/books/NBK22183/</ext-link>
</comment>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal">
<name>
<surname>Wagner</surname>
<given-names>PD</given-names>
</name>
<name>
<surname>Srivastava</surname>
<given-names>S</given-names>
</name>
<article-title>New paradigms in translational science research in cancer biomarkers</article-title>
<source>Transl Res</source>
<year>2012</year>
<volume>159</volume>
<issue>4</issue>
<fpage>343</fpage>
<lpage>353</lpage>
<comment>Epub 2012 Feb 3</comment>
<pub-id pub-id-type="doi">10.1016/j.trsl.2012.01.015</pub-id>
<pub-id pub-id-type="pmid">22424436</pub-id>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Bigbee</surname>
<given-names>WL</given-names>
</name>
<name>
<surname>Gopalakrishnan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Weissfeld</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Wilson</surname>
<given-names>DO</given-names>
</name>
<name>
<surname>Dacic</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lokshin</surname>
<given-names>AE</given-names>
</name>
<name>
<surname>Siegfried</surname>
<given-names>JM</given-names>
</name>
<article-title>A multiplexed serum biomarker immunoassay panel discriminates clinical lung cancer patients from high-risk individuals found to be cancer-free by CT screening</article-title>
<source>J Thorac Oncol</source>
<year>2012</year>
<volume>7</volume>
<issue>4</issue>
<fpage>698</fpage>
<lpage>708</lpage>
<pub-id pub-id-type="doi">10.1097/JTO.0b013e31824ab6b0</pub-id>
<pub-id pub-id-type="pmid">22425918</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="other">
<collab>Cancer Genome Atlas Network</collab>
<article-title>Comprehensive molecular portraits of human breast tumours</article-title>
<source>Nature</source>
<year>2012</year>
<comment>Advanced online publication</comment>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Informatique/explor/SgmlV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000158 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000158 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Informatique
   |area=    SgmlV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4215335
   |texte=   Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:25379168" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a SgmlV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jul 1 14:26:08 2019. Site generation: Wed Apr 28 21:40:44 2021