OcrV1, Pmc, Curation, bibRecord, 000109

Mining images in biomedical publications: Detection and analysis of gel diagrams

Identifieur interne : 000109 ( Pmc/Curation ); précédent : 000108; suivant : 000110

Mining images in biomedical publications: Detection and analysis of gel diagrams

Auteurs : Tobias Kuhn [Suisse] ; Mate Levente Nagy [États-Unis] ; Thaibinh Luong [États-Unis] ; Michael Krauthammer [États-Unis]

Source :

Journal of Biomedical Semantics [ 2041-1480 ] ; 2014.

RBID : PMC:4190668

Abstract

Authors of biomedical publications use gel images to report experimental results such as protein-protein interactions or protein expressions under different conditions. Gel images offer a concise way to communicate such findings, not all of which need to be explicitly discussed in the article text. This fact together with the abundance of gel images and their shared common patterns makes them prime candidates for automated image mining and parsing. We introduce an approach for the detection of gel images, and present a workflow to analyze them. We are able to detect gel segments and panels at high accuracy, and present preliminary results for the identification of gene names in these images. While we cannot provide a complete solution at this point, we present evidence that this kind of image mining is feasible.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4190668

DOI: 10.1186/2041-1480-5-10
PubMed: 24568573
PubMed Central: 4190668

Links toward previous steps (curation, corpus...)

to stream Pmc, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000109

Links to Exploration step

PMC:4190668

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Mining images in biomedical publications: Detection and analysis of gel
diagrams</title>
<author><name sortKey="Kuhn, Tobias" sort="Kuhn, Tobias" uniqKey="Kuhn T" first="Tobias" last="Kuhn">Tobias Kuhn</name>
<affiliation wicri:level="1"><nlm:aff id="I1">Department of Humanities, Social and Political Sciences, ETH Zurich, Zürich, Switzerland</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea>Department of Humanities, Social and Political Sciences, ETH Zurich, Zürich</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Nagy, Mate Levente" sort="Nagy, Mate Levente" uniqKey="Nagy M" first="Mate Levente" last="Nagy">Mate Levente Nagy</name>
<affiliation wicri:level="1"><nlm:aff id="I3">Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Luong, Thaibinh" sort="Luong, Thaibinh" uniqKey="Luong T" first="Thaibinh" last="Luong">Thaibinh Luong</name>
<affiliation wicri:level="1"><nlm:aff id="I2">Department of Pathology, Yale University School of Medicine, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Pathology, Yale University School of Medicine, New Haven, CT</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Krauthammer, Michael" sort="Krauthammer, Michael" uniqKey="Krauthammer M" first="Michael" last="Krauthammer">Michael Krauthammer</name>
<affiliation wicri:level="1"><nlm:aff id="I2">Department of Pathology, Yale University School of Medicine, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Pathology, Yale University School of Medicine, New Haven, CT</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><nlm:aff id="I3">Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">24568573</idno>
<idno type="pmc">4190668</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4190668</idno>
<idno type="RBID">PMC:4190668</idno>
<idno type="doi">10.1186/2041-1480-5-10</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000109</idno>
<idno type="wicri:Area/Pmc/Curation">000109</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Mining images in biomedical publications: Detection and analysis of gel
diagrams</title>
<author><name sortKey="Kuhn, Tobias" sort="Kuhn, Tobias" uniqKey="Kuhn T" first="Tobias" last="Kuhn">Tobias Kuhn</name>
<affiliation wicri:level="1"><nlm:aff id="I1">Department of Humanities, Social and Political Sciences, ETH Zurich, Zürich, Switzerland</nlm:aff>
<country xml:lang="fr">Suisse</country>
<wicri:regionArea>Department of Humanities, Social and Political Sciences, ETH Zurich, Zürich</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Nagy, Mate Levente" sort="Nagy, Mate Levente" uniqKey="Nagy M" first="Mate Levente" last="Nagy">Mate Levente Nagy</name>
<affiliation wicri:level="1"><nlm:aff id="I3">Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Luong, Thaibinh" sort="Luong, Thaibinh" uniqKey="Luong T" first="Thaibinh" last="Luong">Thaibinh Luong</name>
<affiliation wicri:level="1"><nlm:aff id="I2">Department of Pathology, Yale University School of Medicine, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Pathology, Yale University School of Medicine, New Haven, CT</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Krauthammer, Michael" sort="Krauthammer, Michael" uniqKey="Krauthammer M" first="Michael" last="Krauthammer">Michael Krauthammer</name>
<affiliation wicri:level="1"><nlm:aff id="I2">Department of Pathology, Yale University School of Medicine, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Pathology, Yale University School of Medicine, New Haven, CT</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><nlm:aff id="I3">Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series><title level="j">Journal of Biomedical Semantics</title>
<idno type="eISSN">2041-1480</idno>
<imprint><date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Authors of biomedical publications use gel images to report experimental results
such as protein-protein interactions or protein expressions under different
conditions. Gel images offer a concise way to communicate such findings, not all
of which need to be explicitly discussed in the article text. This fact together
with the abundance of gel images and their shared common patterns makes them
prime candidates for automated image mining and parsing. We introduce an
approach for the detection of gel images, and present a workflow to analyze
them. We are able to detect gel segments and panels at high accuracy, and
present preliminary results for the identification of gene names in these
images. While we cannot provide a complete solution at this point, we present
evidence that this kind of image mining is feasible.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Yu, H" uniqKey="Yu H">H Yu</name>
</author>
<author><name sortKey="Lee, M" uniqKey="Lee M">M Lee</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zweigenbaum, P" uniqKey="Zweigenbaum P">P Zweigenbaum</name>
</author>
<author><name sortKey="Demner Fushman, D" uniqKey="Demner Fushman D">D Demner-Fushman</name>
</author>
<author><name sortKey="Yu, H" uniqKey="Yu H">H Yu</name>
</author>
<author><name sortKey="Cohen, Kb" uniqKey="Cohen K">KB Cohen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Peng, H" uniqKey="Peng H">H Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Xu, S" uniqKey="Xu S">S Xu</name>
</author>
<author><name sortKey="Mccusker, J" uniqKey="Mccusker J">J McCusker</name>
</author>
<author><name sortKey="Krauthammer, M" uniqKey="Krauthammer M">M Krauthammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hearst, Ma" uniqKey="Hearst M">MA Hearst</name>
</author>
<author><name sortKey="Divoli, A" uniqKey="Divoli A">A Divoli</name>
</author>
<author><name sortKey="Guturu, H" uniqKey="Guturu H">H Guturu</name>
</author>
<author><name sortKey="Ksikes, A" uniqKey="Ksikes A">A Ksikes</name>
</author>
<author><name sortKey="Nakov, P" uniqKey="Nakov P">P Nakov</name>
</author>
<author><name sortKey="Wooldridge, Ma" uniqKey="Wooldridge M">MA Wooldridge</name>
</author>
<author><name sortKey="Ye, J" uniqKey="Ye J">J Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kuhn, T" uniqKey="Kuhn T">T Kuhn</name>
</author>
<author><name sortKey="Krauthammer, M" uniqKey="Krauthammer M">M Krauthammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kuhn, T" uniqKey="Kuhn T">T Kuhn</name>
</author>
<author><name sortKey="Luong, T" uniqKey="Luong T">T Luong</name>
</author>
<author><name sortKey="Krauthammer, M" uniqKey="Krauthammer M">M Krauthammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Southern, E" uniqKey="Southern E">E Southern</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Alwine, Jc" uniqKey="Alwine J">JC Alwine</name>
</author>
<author><name sortKey="Kemp, Dj" uniqKey="Kemp D">DJ Kemp</name>
</author>
<author><name sortKey="Stark, Gr" uniqKey="Stark G">GR Stark</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Burnette, Wn" uniqKey="Burnette W">WN Burnette</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="De Bruijn, B" uniqKey="De Bruijn B">B De Bruijn</name>
</author>
<author><name sortKey="Martin, J" uniqKey="Martin J">J Martin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Murphy, Rf" uniqKey="Murphy R">RF Murphy</name>
</author>
<author><name sortKey="Kou, Z" uniqKey="Kou Z">Z Kou</name>
</author>
<author><name sortKey="Hua, J" uniqKey="Hua J">J Hua</name>
</author>
<author><name sortKey="Joffe, M" uniqKey="Joffe M">M Joffe</name>
</author>
<author><name sortKey="Cohen, Ww" uniqKey="Cohen W">WW Cohen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Qian, Y" uniqKey="Qian Y">Y Qian</name>
</author>
<author><name sortKey="Murphy, Rf" uniqKey="Murphy R">RF Murphy</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kozhenkov, S" uniqKey="Kozhenkov S">S Kozhenkov</name>
</author>
<author><name sortKey="Baitaluk, M" uniqKey="Baitaluk M">M Baitaluk</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lemkin, Pf" uniqKey="Lemkin P">PF Lemkin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Luhn, S" uniqKey="Luhn S">S Luhn</name>
</author>
<author><name sortKey="Berth, M" uniqKey="Berth M">M Berth</name>
</author>
<author><name sortKey="Hecker, M" uniqKey="Hecker M">M Hecker</name>
</author>
<author><name sortKey="Bernhardt, J" uniqKey="Bernhardt J">J Bernhardt</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cutler, P" uniqKey="Cutler P">P Cutler</name>
</author>
<author><name sortKey="Heald, G" uniqKey="Heald G">G Heald</name>
</author>
<author><name sortKey="White, Ir" uniqKey="White I">IR White</name>
</author>
<author><name sortKey="Ruan, J" uniqKey="Ruan J">J Ruan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rogers, M" uniqKey="Rogers M">M Rogers</name>
</author>
<author><name sortKey="Graham, J" uniqKey="Graham J">J Graham</name>
</author>
<author><name sortKey="Tonge, Rp" uniqKey="Tonge R">RP Tonge</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zerr, T" uniqKey="Zerr T">T Zerr</name>
</author>
<author><name sortKey="Henikoff, S" uniqKey="Henikoff S">S Henikoff</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schlamp, K" uniqKey="Schlamp K">K Schlamp</name>
</author>
<author><name sortKey="Weinmann, A" uniqKey="Weinmann A">A Weinmann</name>
</author>
<author><name sortKey="Krupp, M" uniqKey="Krupp M">M Krupp</name>
</author>
<author><name sortKey="Maass, T" uniqKey="Maass T">T Maass</name>
</author>
<author><name sortKey="Galle, P" uniqKey="Galle P">P Galle</name>
</author>
<author><name sortKey="Teufel, A" uniqKey="Teufel A">A Teufel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rodriguez Esteban, R" uniqKey="Rodriguez Esteban R">R Rodriguez-Esteban</name>
</author>
<author><name sortKey="Iossifov, I" uniqKey="Iossifov I">I Iossifov</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ramakrishnan, C" uniqKey="Ramakrishnan C">C Ramakrishnan</name>
</author>
<author><name sortKey="Patnia, A" uniqKey="Patnia A">A Patnia</name>
</author>
<author><name sortKey="Hovy, Eh" uniqKey="Hovy E">EH Hovy</name>
</author>
<author><name sortKey="Burns, Gapc" uniqKey="Burns G">GAPC Burns</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Xu, S" uniqKey="Xu S">S Xu</name>
</author>
<author><name sortKey="Krauthammer, M" uniqKey="Krauthammer M">M Krauthammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Xu, S" uniqKey="Xu S">S Xu</name>
</author>
<author><name sortKey="Krauthammer, M" uniqKey="Krauthammer M">M Krauthammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Haralick, Rm" uniqKey="Haralick R">RM Haralick</name>
</author>
<author><name sortKey="Shanmugam, K" uniqKey="Shanmugam K">K Shanmugam</name>
</author>
<author><name sortKey="Dinstein, I" uniqKey="Dinstein I">I Dinstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cooper, Gf" uniqKey="Cooper G">GF Cooper</name>
</author>
<author><name sortKey="Herskovits, E" uniqKey="Herskovits E">E Herskovits</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Frank, E" uniqKey="Frank E">E Frank</name>
</author>
<author><name sortKey="Witten, Ih" uniqKey="Witten I">IH Witten</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lecun, Y" uniqKey="Lecun Y">Y LeCun</name>
</author>
<author><name sortKey="Bengio, Y" uniqKey="Bengio Y">Y Bengio</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bradley, Ap" uniqKey="Bradley A">AP Bradley</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tanabe, L" uniqKey="Tanabe L">L Tanabe</name>
</author>
<author><name sortKey="Wilbur, Wj" uniqKey="Wilbur W">WJ Wilbur</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lu, Z" uniqKey="Lu Z">Z Lu</name>
</author>
<author><name sortKey="Kao, Hy" uniqKey="Kao H">HY Kao</name>
</author>
<author><name sortKey="Wei, Ch" uniqKey="Wei C">CH Wei</name>
</author>
<author><name sortKey="Huang, M" uniqKey="Huang M">M Huang</name>
</author>
<author><name sortKey="Liu, J" uniqKey="Liu J">J Liu</name>
</author>
<author><name sortKey="Kuo, Cj" uniqKey="Kuo C">CJ Kuo</name>
</author>
<author><name sortKey="Hsu, Cn" uniqKey="Hsu C">CN Hsu</name>
</author>
<author><name sortKey="Tsai, R" uniqKey="Tsai R">R Tsai</name>
</author>
<author><name sortKey="Dai, Hj" uniqKey="Dai H">HJ Dai</name>
</author>
<author><name sortKey="Okazaki, N" uniqKey="Okazaki N">N Okazaki</name>
</author>
<author><name sortKey="Cho, Hc" uniqKey="Cho H">HC Cho</name>
</author>
<author><name sortKey="Gerner, M" uniqKey="Gerner M">M Gerner</name>
</author>
<author><name sortKey="Solt, I" uniqKey="Solt I">I Solt</name>
</author>
<author><name sortKey="Agarwal, S" uniqKey="Agarwal S">S Agarwal</name>
</author>
<author><name sortKey="Liu, F" uniqKey="Liu F">F Liu</name>
</author>
<author><name sortKey="Vishnyakova, D" uniqKey="Vishnyakova D">D Vishnyakova</name>
</author>
<author><name sortKey="Ruch, P" uniqKey="Ruch P">P Ruch</name>
</author>
<author><name sortKey="Romacker, M" uniqKey="Romacker M">M Romacker</name>
</author>
<author><name sortKey="Rinaldi, F" uniqKey="Rinaldi F">F Rinaldi</name>
</author>
<author><name sortKey="Bhattacharya, S" uniqKey="Bhattacharya S">S Bhattacharya</name>
</author>
<author><name sortKey="Srinivasan, P" uniqKey="Srinivasan P">P Srinivasan</name>
</author>
<author><name sortKey="Liu, H" uniqKey="Liu H">H Liu</name>
</author>
<author><name sortKey="Torii, M" uniqKey="Torii M">M Torii</name>
</author>
<author><name sortKey="Matos, S" uniqKey="Matos S">S Matos</name>
</author>
<author><name sortKey="Campos, D" uniqKey="Campos D">D Campos</name>
</author>
<author><name sortKey="Verspoor, K" uniqKey="Verspoor K">K Verspoor</name>
</author>
<author><name sortKey="Livingston, Km" uniqKey="Livingston K">KM Livingston</name>
</author>
<author><name sortKey="Wilbur, Wj" uniqKey="Wilbur W">WJ Wilbur</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Barbano, Pe" uniqKey="Barbano P">PE Barbano</name>
</author>
<author><name sortKey="Nagy, Ml" uniqKey="Nagy M">ML Nagy</name>
</author>
<author><name sortKey="Krauthammer, M" uniqKey="Krauthammer M">M Krauthammer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sermanet, P" uniqKey="Sermanet P">P Sermanet</name>
</author>
<author><name sortKey="Kavukcuoglu, K" uniqKey="Kavukcuoglu K">K Kavukcuoglu</name>
</author>
<author><name sortKey="Lecun, Y" uniqKey="Lecun Y">Y LeCun</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Krizhevsky, A" uniqKey="Krizhevsky A">A Krizhevsky</name>
</author>
<author><name sortKey="Sutskever, I" uniqKey="Sutskever I">I Sutskever</name>
</author>
<author><name sortKey="Hinton, G" uniqKey="Hinton G">G Hinton</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ciccarese, P" uniqKey="Ciccarese P">P Ciccarese</name>
</author>
<author><name sortKey="Ocana, M" uniqKey="Ocana M">M Ocana</name>
</author>
<author><name sortKey="Garcia Castro, Lj" uniqKey="Garcia Castro L">LJ Garcia Castro</name>
</author>
<author><name sortKey="Das, S" uniqKey="Das S">S Das</name>
</author>
<author><name sortKey="Clark, T" uniqKey="Clark T">T Clark</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sanderson, R" uniqKey="Sanderson R">R Sanderson</name>
</author>
<author><name sortKey="Ciccarese, P" uniqKey="Ciccarese P">P Ciccarese</name>
</author>
<author><name sortKey="Van De Sompel, H" uniqKey="Van De Sompel H">H Van de Sompel</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">J Biomed Semantics</journal-id>
<journal-id journal-id-type="iso-abbrev">J Biomed Semantics</journal-id>
<journal-title-group><journal-title>Journal of Biomedical Semantics</journal-title>
</journal-title-group>
<issn pub-type="epub">2041-1480</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">24568573</article-id>
<article-id pub-id-type="pmc">4190668</article-id>
<article-id pub-id-type="publisher-id">2041-1480-5-10</article-id>
<article-id pub-id-type="doi">10.1186/2041-1480-5-10</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research</subject>
</subj-group>
</article-categories>
<title-group><article-title>Mining images in biomedical publications: Detection and analysis of gel
diagrams</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" corresp="yes" id="A1"><name><surname>Kuhn</surname>
<given-names>Tobias</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>kuhntobias@gmail.com</email>
</contrib>
<contrib contrib-type="author" id="A2"><name><surname>Nagy</surname>
<given-names>Mate Levente</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>mate.nagy@yale.edu</email>
</contrib>
<contrib contrib-type="author" id="A3"><name><surname>Luong</surname>
<given-names>ThaiBinh</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>thaibinh@gmail.com</email>
</contrib>
<contrib contrib-type="author" id="A4"><name><surname>Krauthammer</surname>
<given-names>Michael</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<xref ref-type="aff" rid="I3">3</xref>
<email>michael.krauthammer@yale.edu</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
Department of Humanities, Social and Political Sciences, ETH Zurich, Zürich, Switzerland</aff>
<aff id="I2"><label>2</label>
Department of Pathology, Yale University School of Medicine, New Haven, CT, USA</aff>
<aff id="I3"><label>3</label>
Program for Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA</aff>
<pub-date pub-type="collection"><year>2014</year>
</pub-date>
<pub-date pub-type="epub"><day>25</day>
<month>2</month>
<year>2014</year>
</pub-date>
<volume>5</volume>
<fpage>10</fpage>
<lpage>10</lpage>
<history><date date-type="received"><day>14</day>
<month>6</month>
<year>2013</year>
</date>
<date date-type="accepted"><day>5</day>
<month>2</month>
<year>2014</year>
</date>
</history>
<permissions><copyright-statement>Copyright © 2014 Kuhn et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2014</copyright-year>
<copyright-holder>Kuhn et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0"><license-p>This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>), which
permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly credited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.jbiomedsem.com/content/5/1/10"></self-uri>
<abstract><p>Authors of biomedical publications use gel images to report experimental results
such as protein-protein interactions or protein expressions under different
conditions. Gel images offer a concise way to communicate such findings, not all
of which need to be explicitly discussed in the article text. This fact together
with the abundance of gel images and their shared common patterns makes them
prime candidates for automated image mining and parsing. We introduce an
approach for the detection of gel images, and present a workflow to analyze
them. We are able to detect gel segments and panels at high accuracy, and
present preliminary results for the identification of gene names in these
images. While we cannot provide a complete solution at this point, we present
evidence that this kind of image mining is feasible.</p>
</abstract>
</article-meta>
</front>
<body><sec sec-type="intro"><title>Introduction</title>
<p>A recent trend in the area of literature mining is the inclusion of images in the
form of figures from biomedical publications [<xref ref-type="bibr" rid="B1">1</xref>
-<xref ref-type="bibr" rid="B3">3</xref>]. This development benefits from the fact that an increasing number of
scientific articles are published as open access publications. This means that not
just the abstracts but the complete texts including images are available for data
analysis. Among other things, this enabled the development of query engines for
biomedical images like the Yale Image Finder [<xref ref-type="bibr" rid="B4">4</xref>
] and the BioText Search Engine [<xref ref-type="bibr" rid="B5">5</xref>]. Below, we present our approach to detect and access gel diagrams. This
is an extended version of a previous workshop paper [<xref ref-type="bibr" rid="B6">6</xref>
].</p>
<p>As a preparatory evaluation to decide which image type to focus on, we built a corpus
of 3 000 figures that allows us to reliably estimate the numbers and types of images
in biomedical articles. These figures were drawn randomly from the open access
subset of PubMed Central and then manually annotated. They were split into
subfigures when the figure consisted of several components. Figure <xref ref-type="fig" rid="F1">1</xref> shows the resulting categories and subcategories. This classification
scheme is based on five basic image categories: Experimental/Microscopy, Graph,
Diagram, Clinical and Picture, each divided into multiple subcategories. It shows
that bar graphs (12.4%), black-on-white gels (12.0%), fluorescence microscopy images
(9.4%), and line graphs (8.1%) are the most frequent subfigure types (all
percentages are relative to the entire set of images).</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p>Categorization of images from open access articles of PubMed Central.</p>
</caption>
<graphic xlink:href="2041-1480-5-10-1"></graphic>
</fig>
<p>We targeted different kinds of graphs (i.e., diagrams with axes) in previous work [<xref ref-type="bibr" rid="B7">7</xref>], and we decided to focus this work on the second most common type of
images: gel diagrams. They are the result of gel electrophoresis, which is a common
method to analyze DNA, RNA and proteins. Southern, Western and Northern blotting [<xref ref-type="bibr" rid="B8">8</xref>
-<xref ref-type="bibr" rid="B10">10</xref>] are among the most common applications of gel electrophoresis. The
resulting experimental artifacts are often shown in biomedical publications in the
form of gel images as evidence for the discussed findings such as protein-protein
interactions or protein expressions under different conditions. Often, not all
details of the results shown in these images are explicitly stated in the caption or
the article text. For these reasons, it would be of high value to be able to
reliably mine the relations encoded in these images.</p>
<p>A closer look at gel images reveals that they follow regular patterns to encode their
semantic relations. Figure <xref ref-type="fig" rid="F2">2</xref> shows two typical examples of gel
images together with a table representation of the involved relations. The ultimate
objective of our approach (for which we can only present a partial solution here) is
to automatically extract at least some of these relations from the respective
images, possibly in conjunction with classical text mining techniques. The first
example shows a Western blot for detecting two proteins (14-3-3 <italic>σ</italic> and
<italic>β</italic>-actin as a control) in four different cell lines (MDA-MB-231,
NHEM, C8161.9, and LOX, the first of which is used as a control). There are two
rectangular gel segments arranged in a way to form a 2×4 grid for the
individual eight measurements combining each protein with each cell line. A gel
diagram can be considered a kind of matrix with pictures of experimental artifacts
as content. The tables to the right illustrate the semantic relations encoded in the
gel diagrams. Each relation instance consists of a condition, a measurement and a
result. The proteins are the entities being measured under the conditions of the
different cell lines. The result is a certain degree of expression indicated by the
darkness of the spots (or brightness in the case of white-on-black gels). The second
example is a slightly more complex one. Several proteins are tested against each
other in a way that involves more than two dimensions. In this case, the use of
“+” and “–” labels is a frequent technique to denote
the different possible combinations of a number of conditions. Apart from that, the
principles are the same. In this case, however, the number of relations is much
larger. Only the first eight of a total of 32 relation instances are shown in the
table to the right. In such cases, the text rarely mentions all these relations in
an explicit way, and the image is therefore the only accessible source.</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p>Two examples of gel images from biomedical publications (PMID 19473536
and 15125785) with tables showing the relations that could be extracted
from them.</p>
</caption>
<graphic xlink:href="2041-1480-5-10-2"></graphic>
</fig>
</sec>
<sec><title>Background</title>
<p>In principle, image mining involves the same processes as classical literature mining [<xref ref-type="bibr" rid="B11">11</xref>]: document categorization, named entity tagging, fact extraction, and
collection-wide analysis. However, there are some subtle differences. Document
categorization corresponds to image categorization, which is different in the sense
that it has to deal with features based on the two-dimensional space of pixels, but
otherwise the same principles of automatic categorization apply. Named entity
tagging is different in two ways: pinpointing the mention of an entity is more
difficult with images (a large number of pixels versus a couple of characters), and
OCR errors have to be considered. Fact extraction in classical literature mining
involves the analysis of the syntactic structure of the sentences. In images, in
contrast, there are rarely complete sentences, but the semantics is rather encoded
by graphical means. Thus, instead of parsing sentences, one has to analyze graphical
elements and their relation to each other. The last process, collection-wide
analysis, is a higher-level problem, and therefore no fundamental differences can be
expected. Thus, image mining builds upon the same general stages as classical text
mining, but with some subtle yet important differences.</p>
<p>Image mining on biomedical publications is not a new idea. It has been applied for
the extraction of subcellular location information [<xref ref-type="bibr" rid="B12">12</xref>
], the detection of panels of fluorescence microscopy images [<xref ref-type="bibr" rid="B13">13</xref>
], the extraction of pathway information from diagrams [<xref ref-type="bibr" rid="B14">14</xref>
], and the detection of axis diagrams [<xref ref-type="bibr" rid="B7">7</xref>]. Also, there is a large amount of existing work on how to process gel
images [<xref ref-type="bibr" rid="B15">15</xref>
-<xref ref-type="bibr" rid="B19">19</xref>
] and databases have been proposed to store the results of gel analyses [<xref ref-type="bibr" rid="B20">20</xref>]. These techniques, however, take as input plain gel images, which are not
readily accessible from biomedical papers, because they make up just parts of the
figures. Furthermore, these tools are designed for researchers who want to analyze
their gel images and not to read gel diagrams that have already been analyzed and
annotated by a researcher. Therefore, these approaches do not tackle the problem of
recognizing and analyzing the labels of gel images. Some attempts to classify
biomedical images include gel figures [<xref ref-type="bibr" rid="B21">21</xref>], which is, however, just the first step in locating them and analyzing
their labels and their structure. To our knowledge, nobody has yet tried to perform
image mining on gel diagrams.</p>
</sec>
<sec><title>Approach and methods</title>
<p>Figure <xref ref-type="fig" rid="F3">3</xref> shows the procedure of our approach to image mining
from gel diagrams. It consists of seven steps: figure extraction, segmentation, text
recognition, gel detection, gel panel detection, named entity recognition and
relation extraction.<sup>a</sup>
</p>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p>The procedure of our approach: (1) figure extraction, (2) segmentation,
(3) text recognition, (4) gel detection, (5) gel panel detection, (6)
named entity recognition, and (7) relation extraction.</p>
</caption>
<graphic xlink:href="2041-1480-5-10-3"></graphic>
</fig>
<p>Using structured article representations, the first step is trivial. For steps two
and three, we rely on existing work. The main focus of this paper lies on steps four
and five: the detection of gels and gel panels. In the discussion section, we
present some preliminary results on step six of recognizing named entities, and
sketch how step seven could be implemented, for which we cannot provide a concrete
solution at this point.</p>
<p>To practically evaluate our approach, we ran our pipeline on the entire open access
subset of PubMed Central (though not all figures made it through the whole pipeline
due to technical difficulties).</p>
<sec><title>Figure extraction</title>
<p>A large portion of the articles of the open access subset of the PubMed Central
database are available as structured XML files with additional image files for
the figures. We only use these articles so far, which makes the figure
extraction task very easy. It would be more difficult, though definitely
feasible, to extract the figures from PDF files or even bitmaps of scanned
articles (see [<xref ref-type="bibr" rid="B22">22</xref>
] and <ext-link ext-link-type="uri" xlink:href="http://pdfjailbreak.com">http://pdfjailbreak.com</ext-link> for approaches on extracting
the structure of articles in PDF format).</p>
</sec>
<sec><title>Segmentation and text recognition</title>
<p>For the next two steps — segment detection and subsequent text recognition
—, we rely on our previous work [<xref ref-type="bibr" rid="B23">23</xref>
,<xref ref-type="bibr" rid="B24">24</xref>]. This method includes the detection of layout elements, edge
detection, and text recognition with a novel pivoting approach. For optical
character recognition (OCR), the Microsoft Document Imaging package is used,
which is available as part of Microsoft Office 2003. Overall, this approach has
been shown to perform better than other existing approaches for the images found
in biomedical publications [<xref ref-type="bibr" rid="B23">23</xref>]. We do not go into the details here, as this paper focuses on the
subsequent steps.</p>
<p>Due to some limitations of the segmentation algorithm when it comes to rectangles
with low internal contrast (like gels), we applied a complementary very simple
rectangle detection algorithm.</p>
</sec>
<sec><title>Gel segment detection</title>
<p>Based on the results of the above-mentioned steps, we try to identify gel
segments. Such gel segments typically have rectangular shapes with darker spots
on a light gray background, or — less commonly — white spots on a
dark background. We decided to use machine learning techniques to generate
classifiers to detect such gel segments. To do so, we defined 39 numerical
features for image segments: the coordinates of the relative position (within
the image), the relative and absolute width and height, 16 grayscale histogram
features, three color features (for red, green and blue), 13 texture features
(coarseness, presence of ripples, etc.) based on [<xref ref-type="bibr" rid="B25">25</xref>
], and the number of recognized characters.</p>
<p>To train the classifiers, we took a random sample of 500 figures, for which we
manually annotated the gel segments. In the same way, we obtained a second
sample of another 500 figures for testing the classifiers.<sup>b</sup> We used
the Weka toolkit and opted for random forest classifiers based on 75 random
trees. Using different thresholds to adjust the trade-off between precision and
recall, we generated a classifier with good precision and another one with good
recall. Both of them are used in the next step. We tried other types of
classifiers including naive Bayes, Bayesian networks [<xref ref-type="bibr" rid="B26">26</xref>
], PART decision lists [<xref ref-type="bibr" rid="B27">27</xref>
], and convolutional networks [<xref ref-type="bibr" rid="B28">28</xref>
], but we achieved the best results with random forests.</p>
</sec>
<sec><title>Gel panel detection</title>
<p>A gel panel typically consists of several gel segments and comes with labels
describing the involved genes, proteins, and conditions. For our goal, it is not
sufficient to just detect the figures that contain gel panels, but we also have
to extract their positions within the figures and to access their labels. This
is not a simple classification task, and therefore machine learning techniques
do not apply that easily. For that reason, we used a detection procedure based
on hand-coded rules.</p>
<p>In a first step, we group gel segments to find contiguous gel regions that form
the center part of gel panels. To do so, we start with looking for segments that
our high-precision classifier detects as gel segments. Then, we repeatedly look
for adjacent gel segments, this time applying the high-recall classifier, and
merge them. Two segments are considered neighbors if they are at most 50 pixels
apart<sup>c</sup> and do not have any text segment between them. Thus,
segments which could be gel segments according to the high-recall classifier
make it into a gel panel only if there is at least one high-precision segment in
their group. The goal is to detect panels with high precision, but also to
detect the complete panels and not just parts of them. We focus here on
precision because low recall can be leveraged by the large number of available
gel images. Furthermore, as the open access part of PubMed Central only makes up
a small subset of all biomedical publications, recall in a more general sense is
anyway limited by the proportion of open access publications.</p>
<p>As a next step, we collect the labels in the form of text segments located around
the detected gel regions. For a text segment to be attributed to a certain gel
panel, its nearest edge must be at most 30 pixels away from the border of the
gel region and its farthest edge must not be more than 150 pixels away. We end
up with a representation of a gel panel consisting of two parts: a center region
containing a number of gel segments and a set of labels in the form of text
segments located around the center region.</p>
<p>To evaluate this algorithm, we collected yet another sample of 500 figures, in
which 106 gel panels in 61 different figures were revealed by manual
annotation.<sup>d</sup> Based on this sample, we manually checked whether
our algorithm is able to detect the presence and the (approximate) position of
the gel panels.</p>
</sec>
</sec>
<sec sec-type="results"><title>Results</title>
<p>The top part of Table <xref ref-type="table" rid="T1">1</xref> shows the result of the gel detection
classifier. We generated three different classifiers from the training data, one for
each of the threshold values 0.15, 0.3 and 0.6. Lower threshold values lead to
higher recall at the cost of precision, and vice versa. In the balanced case, we
achieved an F-score of 75%. To get classifiers with precision or recall over 90%,
F-score goes down significantly, but stays in a sensible range. These two
classifiers (thresholds 0.15 and 0.6) are used in the next step. To interpret these
values, one has to consider that gel segments are greatly outnumbered by non-gel
segments. Concretely, only about 3% are gel segments. More sophisticated accuracy
measures for classifier performance, such as the area under the ROC curve [<xref ref-type="bibr" rid="B29">29</xref>], take this into account. For the presented classifiers, the area under
the ROC curve is 98.0% (on a scale from 50% for a trivial, worthless classifier to
100% for a perfect one).</p>
<table-wrap position="float" id="T1"><label>Table 1</label>
<caption><p>The results of the gel segment detection classifiers (top) and the gel
panel detection algorithm (bottom)</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="left"></col>
<col align="right"></col>
<col align="right"></col>
<col align="right"></col>
<col align="right"></col>
<col align="right"></col>
</colgroup>
<thead valign="top"><tr><th align="left"> </th>
<th align="left"><bold>Method</bold>
</th>
<th align="right"><bold>Threshold</bold>
</th>
<th align="right"><bold>Precision</bold>
</th>
<th align="right"><bold>Recall</bold>
</th>
<th align="right"><bold>F-score</bold>
</th>
<th align="right"><bold>ROC area</bold>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom"> <hr></hr>
</td>
<td align="right" valign="bottom">0.15<hr></hr>
</td>
<td align="right" valign="bottom">0.439<hr></hr>
</td>
<td align="right" valign="bottom">0.909<hr></hr>
</td>
<td align="right" valign="bottom">0.592<hr></hr>
</td>
<td align="right" valign="bottom"> <hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom">Random forests<hr></hr>
</td>
<td align="right" valign="bottom">0.30<hr></hr>
</td>
<td align="right" valign="bottom">0.765<hr></hr>
</td>
<td align="right" valign="bottom">0.739<hr></hr>
</td>
<td align="right" valign="bottom">0.752<hr></hr>
</td>
<td align="right" valign="bottom"><disp-formula><graphic xlink:href="2041-1480-5-10-i1.gif"></graphic>
</disp-formula>0.980
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom"> <hr></hr>
</td>
<td align="right" valign="bottom">0.60<hr></hr>
</td>
<td align="right" valign="bottom">0.926<hr></hr>
</td>
<td align="right" valign="bottom">0.301<hr></hr>
</td>
<td align="right" valign="bottom">0.455<hr></hr>
</td>
<td align="right" valign="bottom"> <hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Segments<hr></hr>
</td>
<td align="left" valign="bottom">Naive Bayes<hr></hr>
</td>
<td align="right" valign="bottom"> <hr></hr>
</td>
<td align="right" valign="bottom">0.172<hr></hr>
</td>
<td align="right" valign="bottom">0.739<hr></hr>
</td>
<td align="right" valign="bottom">0.279<hr></hr>
</td>
<td align="right" valign="bottom">0.883<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom">Bayesian network<hr></hr>
</td>
<td align="right" valign="bottom"> <hr></hr>
</td>
<td align="right" valign="bottom">0.394<hr></hr>
</td>
<td align="right" valign="bottom">0.531<hr></hr>
</td>
<td align="right" valign="bottom">0.452<hr></hr>
</td>
<td align="right" valign="bottom">0.914<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom">PART decision list<hr></hr>
</td>
<td align="right" valign="bottom"> <hr></hr>
</td>
<td align="right" valign="bottom">0.631<hr></hr>
</td>
<td align="right" valign="bottom">0.496<hr></hr>
</td>
<td align="right" valign="bottom">0.555<hr></hr>
</td>
<td align="right" valign="bottom">0.777<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"> <hr></hr>
</td>
<td align="left" valign="bottom">Convolutional networks<hr></hr>
</td>
<td align="right" valign="bottom"> <hr></hr>
</td>
<td align="right" valign="bottom">0.142<hr></hr>
</td>
<td align="right" valign="bottom">0.949<hr></hr>
</td>
<td align="right" valign="bottom">0.248<hr></hr>
</td>
<td align="right" valign="bottom"> <hr></hr>
</td>
</tr>
<tr><td align="left">Panels</td>
<td align="left">Hand-coded rules</td>
<td align="right"> </td>
<td align="right">0.951</td>
<td align="right">0.368</td>
<td align="right">0.530</td>
<td align="right"> </td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results of the gel panel detection algorithm are shown in the bottom part of
Table <xref ref-type="table" rid="T1">1</xref>. The precision is 95% at a recall of 37%, leading to
an F-score of 53%. The comparatively low recall is mainly due to the general problem
of pipeline-based approaches that the various errors from the earlier steps
accumulate and are hard to correct at a later stage in the pipeline.</p>
<p>Table <xref ref-type="table" rid="T2">2</xref> shows the results of running the pipeline on PubMed
Central. We started with about 410 000 articles, the entire open access subset of
PubMed Central at the time we downloaded them (February 2012). We successfully
parsed the XML files of 94% of these articles (for the remaining articles, the XML
file was missing or not well-formed, or other unexpected errors occurred). The
successful articles contained around 1 100 000 figures, for some of which our
segment detection step encountered image formatting errors or other internal errors,
or was just not able to detect any segments. We ended up with more than 880 000
figures, in which we detected about 86 000 gel panels, i.e. roughly ten out of 100
figures. For each of them, we found on average 3.6 labels with recognized text.
After tokenization, we identified about 76 000 gene names in these gel labels, which
corresponds to 6.8% of the tokens. Considering all text segments (including but not
restricted to gel labels), only 3.3% of the tokens are detected as gene
names.<sup>e</sup>
</p>
<table-wrap position="float" id="T2"><label>Table 2</label>
<caption><p>The results of running the pipeline on the open access subset of PubMed
Central</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="right"></col>
</colgroup>
<thead valign="top"><tr><th align="left" valign="bottom">Total articles<hr></hr>
</th>
<th align="right" valign="bottom">410 950<hr></hr>
</th>
</tr>
<tr><th align="left">Processed articles</th>
<th align="right">386 428</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="bottom">Total figures from processed articles<hr></hr>
</td>
<td align="right" valign="bottom">1 110 643<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Processed figures<hr></hr>
</td>
<td align="right" valign="bottom">884 152<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Detected gel panels<hr></hr>
</td>
<td align="right" valign="bottom">85 942<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Detected gel panels per figure<hr></hr>
</td>
<td align="right" valign="bottom">0.097<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Detected gel labels<hr></hr>
</td>
<td align="right" valign="bottom">309 340<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Detected gel labels per panel<hr></hr>
</td>
<td align="right" valign="bottom">3.599<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Detected gene tokens<hr></hr>
</td>
<td align="right" valign="bottom">1 854 609<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Detected gene tokens in gel labels<hr></hr>
</td>
<td align="right" valign="bottom">75 610<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">Gene token ratio<hr></hr>
</td>
<td align="right" valign="bottom">0.033<hr></hr>
</td>
</tr>
<tr><td align="left">Gene token ratio in gel labels</td>
<td align="right">0.068</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec sec-type="discussion"><title>Discussion</title>
<p>The presented results show that we are able to detect gel segments with high
accuracy, which allows us to subsequently detect whole gel panels at a high
precision. The recall of the panel detection step is relatively low, but with about
37% still in a reasonable range. As mentioned above, we can leverage the high number
of available figures, which makes precision more important than recall. Running our
pipeline on the whole set of open access articles from PubMed Central, we were able
to retrieve 85 942 potential gel panels (around 95% of which we can expect to be
correctly detected).</p>
<p>The next step would be to recognize the named entities mentioned in the gel labels.
To this aim, we did a preliminary study to investigate whether we are able to
extract the names of genes and proteins from gel diagrams. To do so, we tokenized
the label texts and looked for entries in the Entrez Gene database to match the
tokens. This look-up was done in a case-sensitive way, because many names in gel
labels are acronyms, where the specific capitalization pattern can be critical to
identify the respective entity. We excluded tokens that have less than three
characters, are numbers (Arabic or Latin), or correspond to common short words
(retrieved from a list of the 100 most frequent words in biomedical articles). In
addition, we extended this exclusion list with 22 general words that are frequently
used in the context of gel diagrams, some of which coincide with gene names
according to Entrez.<sup>f</sup> Since gel electrophoresis is a method to analyze
genes and proteins, we would expect to find more such mentions in gel labels than in
other text segments of a figure. By measuring this, we get an idea of whether the
approach works out or not. In addition, we manually checked the gene and protein
names extracted from gel labels after running our pipeline on 2 000 random figures.
In 124 of these figures, at least one gel panel was detected. Table <xref ref-type="table" rid="T3">3</xref> shows the results of this preliminary evaluation. Almost two-thirds of
the detected gene/protein tokens (65.3%) were correctly identified. 9% thereof were
correct but could be more specific, e.g. when only “actin” was
recognized for “ <italic>β</italic>-actin” (which is not incorrect but of
course much harder to map to a meaningful identifier). The incorrect cases (34.6%)
can be split into two classes of roughly the same size: some recognized tokens were
actually not mentioned in the figure but emerged from OCR errors; other tokens were
correctly recognized but incorrectly classified as gene or protein references.</p>
<table-wrap position="float" id="T3"><label>Table 3</label>
<caption><p>Numbers of recognized gene/protein tokens in 2 000 random figures</p>
</caption>
<table frame="hsides" rules="groups" border="1"><colgroup><col align="left"></col>
<col align="right"></col>
<col align="right"></col>
</colgroup>
<thead valign="top"><tr><th align="left"> </th>
<th align="right"><bold>Absolute</bold>
</th>
<th align="right"><bold>Relative</bold>
</th>
</tr>
</thead>
<tbody valign="top"><tr><td align="left" valign="bottom"><bold>Total</bold>
<hr></hr>
</td>
<td align="right" valign="bottom"><bold>156</bold>
<hr></hr>
</td>
<td align="right" valign="bottom"><bold>100.0%</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"><bold>Incorrect</bold>
<hr></hr>
</td>
<td align="right" valign="bottom"><bold>54</bold>
<hr></hr>
</td>
<td align="right" valign="bottom"><bold>34.6%</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">– Not mentioned (OCR errors)<hr></hr>
</td>
<td align="right" valign="bottom">28<hr></hr>
</td>
<td align="right" valign="bottom">17.9%<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">– Not references to genes or proteins<hr></hr>
</td>
<td align="right" valign="bottom">26<hr></hr>
</td>
<td align="right" valign="bottom">16.7%<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom"><bold>Correct</bold>
<hr></hr>
</td>
<td align="right" valign="bottom"><bold>102</bold>
<hr></hr>
</td>
<td align="right" valign="bottom"><bold>65.3%</bold>
<hr></hr>
</td>
</tr>
<tr><td align="left" valign="bottom">– Partially correct (could be more specific)<hr></hr>
</td>
<td align="right" valign="bottom">14<hr></hr>
</td>
<td align="right" valign="bottom">9.0%<hr></hr>
</td>
</tr>
<tr><td align="left">– Fully correct</td>
<td align="right">88</td>
<td align="right">56.4%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Although there is certainly much room for improvement, this simple gene detection
step seems to perform reasonably well.</p>
<p>For the last step, relation extraction, we cannot present any concrete results at
this point. After recognizing the named entities, we would have to disambiguate
them, identify their semantic roles (condition, measurement or something else),
align the gel images with the labels, and ultimately quantify the degree of
expression. To improve the quality of the results, combinations with classical text
mining techniques should be considered. This is all future work. We expect to be
able to profit to a large extent from existing work to disambiguate protein and gene
names [<xref ref-type="bibr" rid="B30">30</xref>
,<xref ref-type="bibr" rid="B31">31</xref>] and to detect and analyze gel spots (see the existing work mentioned
above).</p>
<p>It seems reasonable to assume that these results can be combined with existing
techniques of term disambiguation and gel spot detection at a satisfactory level of
accuracy. We plan to investigate this in future work.</p>
<p>As mentioned above, we have started to investigate how the gel segment detection step
could be improved by the use of the image recognition technique of convolutional
networks (ConvNet) [<xref ref-type="bibr" rid="B28">28</xref>
]. We started with a simplified approach to the one presented in [<xref ref-type="bibr" rid="B32">32</xref>]. In this approach, images are tiled into small quadratic pieces. We used
a single network (and not several parallel networks), based on 48×48 input tile
images with three layers of convolutions. The first layer takes eight 5×5
convolutions and is followed by a 2×2 sub-sampling. The second layer takes
twenty four 5×5 convolutions and is followed by a 3×3 sub-sampling. The
last layer takes seventy two 6 × 6 convolutions, which leads to a fully
connected layer. We trained our ConvNet on the 500 images of the training set, where
we manually annotated the tiles as <italic>gel</italic>
 and <italic>non-gel</italic>. With the use
of EBLearn [<xref ref-type="bibr" rid="B33">33</xref>], this trained ConvNet classified the tiles of the 500 images of our
testing set. The classified tiles can then be reconstructed into a mask image, as
shown in Figure <xref ref-type="fig" rid="F4">4</xref>. A manual check of the clusters of
recognized gel tiles led to the results shown in Table <xref ref-type="table" rid="T1">1</xref>.
Recall is very good (95%) but precision is very poor (14%), leading to an F-score of
25%. This is much worse than the results we got with our random forest approach,
which is why ConvNet is currently not part of our pipeline. We hope, however, that
we can further optimize this ConvNet approach and combine it with random forests to
exploit their (hopefully) complementary benefits. Using ConvNet to classify complete
images as <italic>gel-image</italic>
 or <italic>non-gel-image</italic> and adjusting the
classification to account for unbalanced classes, we were able to obtain an F-score
of 74%, which makes us confident that a combination of the two approaches could lead
to a significant improvement of our gel segment detection step. As an alternative
approach, we will try to run ConvNet on down-scaled entire panels rather than small
tiles, as described in [<xref ref-type="bibr" rid="B34">34</xref>]. Furthermore, we will experiment with parallel networks instead of single
ones to improve accuracy.</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p><bold>Original and mask image after ConvNet classification for an exemplary
image from PMID 14993249.</bold>
 Green means <italic>gel</italic>; brown means
<italic>other</italic>
; and white means <italic>not enough gradient
information</italic>
.</p>
</caption>
<graphic xlink:href="2041-1480-5-10-4"></graphic>
</fig>
<p>The results obtained from our gel recognition pipeline indicate that it is feasible
to extract relations from gel images, but it is clear that this procedure is far
from perfect. The automatic analysis of bitmap images seems to be the only efficient
way to extract such relations from existing publications, but other publishing
techniques should be considered for the future. The use of vector graphics instead
of bitmaps would already greatly improve any subsequent attempts of automatic
analysis. A further improvement would be to establish accepted standards for
different types of biomedical diagrams in the spirit of the Unified Modeling
Language, a graphical language widely applied in software engineering since the
1990s. Ideally, the resulting images could directly include semantic relations in a
formal notation, which would make relation mining a trivial procedure. If authors
are supported by good tools to draw diagrams like gel images, this approach could
turn out to be feasible even in the near future.</p>
<p>Concretely, we would like to take the opportunity to postulate the following actions,
which we think should be carried out to make the content of images in biomedical
articles more accessible: </p>
<p>• <bold>Stop pressing diagrams into bitmaps!</bold> Unless the image only
consists of one single photograph, screenshot, or another kind of picture that only
has bitmap representation, vector graphics should be used for article figures.</p>
<p>• <bold>Let data and metadata travel from the tools that generate
diagrams to the final articles!</bold> Whenever the specific tool that is used to
generate the diagram “knows” that a certain graphical element refers to
an organism, a gene, an interaction, a point in time, or another kind of entity,
then this information should be stored in the image file, passed on, and finally
published with the article.</p>
<p>• <bold>Use RDF vocabularies to embed semantic annotations in
diagrams!</bold> Tools for creating scientific diagrams should use RDF notation and
stick to existing standardized schemas (or define new ones if required) to annotate
the diagram files they create.</p>
<p>• <bold>Define standards for scientific diagrams!</bold> In the spirit of
the Unified Modeling Language, the biomedical community should come up with
standards that define the appearance and meaning of different types of diagrams.</p>
<p>Obviously, different groups of people need to be involved in these actions, namely
article authors, journal editors, and tool developers. It is relatively inexpensive
to follow these postulates (though it might require some time), which in turn would
greatly improve data sharing, image mining, and scientific communication in general.
Standardized diagrams could be the long sought solution to the problem of how to let
authors publish computer-processable formal representations for (part of) their
results. This can build upon the efforts of establishing an open annotation model [<xref ref-type="bibr" rid="B35">35</xref>
,<xref ref-type="bibr" rid="B36">36</xref>
].</p>
</sec>
<sec sec-type="conclusions"><title>Conclusions</title>
<p>Successful image mining from gel diagrams in biomedical publications would unlock a
large amount of valuable data. Our results show that gel panels and their labels can
be detected with high accuracy, applying machine learning techniques and hand-coded
rules. We also showed that genes and proteins can be detected in the gel labels with
satisfactory precision.</p>
<p>Based on these results, we believe that this kind of image mining is a promising and
viable approach to provide more powerful query interfaces for researchers, to gather
relations such as protein-protein interactions, and to generally complement existing
text mining approaches. At the same time, we believe that an effort towards
standardization of scientific diagrams such as gel images would greatly improve the
efficiency and precision of image mining at relatively low additional costs at the
time of publication.</p>
</sec>
<sec><title>Endnotes</title>
<p><sup>a</sup> Due to the fact that many figures consist of multiple panels of
different types, we go straight to gel segment detection without first classifying
entire images. Most gel panels share their figure with other panels, which makes
automated classification difficult at the image level.</p>
<p><sup>b</sup> We double-checked these manual annotations to check their quality,
which revealed only four misclassified segments in total for the training and test
samples (0.016% of all segments).</p>
<p><sup>c</sup> We are using absolute distance values at this point. A more refined
algorithm could apply some sort of relative measure. However, the resolution of the
images does not vary that much, which is why absolute values worked out well so
far.</p>
<p><sup>d</sup> Again, these manual annotations were double-checked to ensure their
quality. Five errors were found and fixed in this process.</p>
<p><sup>e</sup> The low numbers are partially due to the fact that a considerable part
of the tokens are “junk tokens” produced by the OCR step when trying to
recognize characters in segments that do not contain text.</p>
<p><sup>f</sup>
 These words are: <italic>min</italic>
, <italic>hrs</italic>
, <italic>line</italic>,
<italic>type</italic>
, <italic>protein</italic>
, <italic>DNA</italic>
, <italic>RNA</italic>
, <italic>mRNA</italic>,
<italic>membrane</italic>
, <italic>gel</italic>
, <italic>fold</italic>
, <italic>fragment</italic>,
<italic>antigen</italic>
, <italic>enzyme</italic>
, <italic>kinase</italic>
, <italic>cleavage</italic>,
<italic>factor</italic>
, <italic>blot</italic>
, <italic>pro</italic>
, <italic>pre</italic>
, <italic>peptide</italic>,
and <italic>cell</italic>
.</p>
</sec>
<sec><title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><title>Authors’ contributions</title>
<p>TK was the main author and main contributor of the presented work. He was responsible
for designing and implementing the pipeline, gathering the data, performing the
evaluation, and analyzing the results. MLN applied, trained, and evaluated the
ConvNet classifier, and contributed to the annotation of the test sets. TL built and
analyzed the corpus for the preparatory evaluation. MK contributed to the conception
and the design of the approach and to the analysis of the results. All authors have
been involved in drafting or revising the manuscript, and all authors read and
approved the final manuscript.</p>
</sec>
</body>
<back><sec><title>Acknowledgements</title>
<p>This study has been supported by the National Library of Medicine grant
5R01LM009956.</p>
</sec>
<ref-list><ref id="B1"><mixed-citation publication-type="journal"><name><surname>Yu</surname>
<given-names>H</given-names>
</name>
<name><surname>Lee</surname>
<given-names>M</given-names>
</name>
<article-title><bold>Accessing bioscience images from abstract sentences</bold>
</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>14</issue>
<fpage>e547</fpage>
<lpage>e556</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl261</pub-id>
<pub-id pub-id-type="pmid">16873519</pub-id>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><name><surname>Zweigenbaum</surname>
<given-names>P</given-names>
</name>
<name><surname>Demner-Fushman</surname>
<given-names>D</given-names>
</name>
<name><surname>Yu</surname>
<given-names>H</given-names>
</name>
<name><surname>Cohen</surname>
<given-names>KB</given-names>
</name>
<article-title><bold>Frontiers of biomedical text mining: current progress</bold>
</article-title>
<source>Brief Bioinform</source>
<year>2007</year>
<volume>8</volume>
<issue>5</issue>
<fpage>358</fpage>
<lpage>375</lpage>
<pub-id pub-id-type="doi">10.1093/bib/bbm045</pub-id>
<pub-id pub-id-type="pmid">17977867</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><name><surname>Peng</surname>
<given-names>H</given-names>
</name>
<article-title><bold>Bioimage informatics: a new area of engineering biology</bold>
</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<issue>17</issue>
<fpage>1827</fpage>
<lpage>1836</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btn346">http://dx.doi.org/10.1093/bioinformatics/btn346</ext-link>
]</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn346</pub-id>
<pub-id pub-id-type="pmid">18603566</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><name><surname>Xu</surname>
<given-names>S</given-names>
</name>
<name><surname>McCusker</surname>
<given-names>J</given-names>
</name>
<name><surname>Krauthammer</surname>
<given-names>M</given-names>
</name>
<article-title><bold>Yale Image Finder (YIF): a new search engine for retrieving biomedical
images</bold>
</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<issue>17</issue>
<fpage>1968</fpage>
<lpage>1970</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btn340">http://dx.doi.org/10.1093/bioinformatics/btn340</ext-link>
]</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn340</pub-id>
<pub-id pub-id-type="pmid">18614584</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><name><surname>Hearst</surname>
<given-names>MA</given-names>
</name>
<name><surname>Divoli</surname>
<given-names>A</given-names>
</name>
<name><surname>Guturu</surname>
<given-names>H</given-names>
</name>
<name><surname>Ksikes</surname>
<given-names>A</given-names>
</name>
<name><surname>Nakov</surname>
<given-names>P</given-names>
</name>
<name><surname>Wooldridge</surname>
<given-names>MA</given-names>
</name>
<name><surname>Ye</surname>
<given-names>J</given-names>
</name>
<article-title><bold>BioText search engine</bold>
</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>16</issue>
<fpage>2196</fpage>
<lpage>2197</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btm301">http://dx.doi.org/10.1093/bioinformatics/btm301</ext-link>
]</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm301</pub-id>
<pub-id pub-id-type="pmid">17545178</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="book"><name><surname>Kuhn</surname>
<given-names>T</given-names>
</name>
<name><surname>Krauthammer</surname>
<given-names>M</given-names>
</name>
<article-title><bold>Image Mining from Gel Diagrams in Biomedical Publications</bold>
</article-title>
<source>Proceedings of the 5th International Symposium on Semantic Mining in
Biomedicine (SMBM 2012)</source>
<year>2012</year>
<publisher-name>University of Zurich: Zurich Switzerland</publisher-name>
<fpage>26</fpage>
<lpage>33</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://www.zora.uzh.ch/64476/">http://www.zora.uzh.ch/64476/</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="book"><name><surname>Kuhn</surname>
<given-names>T</given-names>
</name>
<name><surname>Luong</surname>
<given-names>T</given-names>
</name>
<name><surname>Krauthammer</surname>
<given-names>M</given-names>
</name>
<article-title><bold>Finding and Accessing Diagrams in Biomedical Publications</bold>
</article-title>
<source>Proceedings of the American Medical Informatics Association (AMIA) 2012
Annual Symposium</source>
<year>2012</year>
<publisher-name>Bethesda, MD, USA: American Medical Informatics Association</publisher-name>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="journal"><name><surname>Southern</surname>
<given-names>E</given-names>
</name>
<article-title><bold>Detection of specific sequences among DNA fragments separated by gel
electrophoresis</bold>
</article-title>
<source>J Mol Biol</source>
<year>1975</year>
<volume>98</volume>
<issue>3</issue>
<fpage>503</fpage>
<lpage>517</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(75)80083-0</pub-id>
<pub-id pub-id-type="pmid">1195397</pub-id>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="journal"><name><surname>Alwine</surname>
<given-names>JC</given-names>
</name>
<name><surname>Kemp</surname>
<given-names>DJ</given-names>
</name>
<name><surname>Stark</surname>
<given-names>GR</given-names>
</name>
<article-title><bold>Method for detection of specific RNAs in agarose gels by transfer to
diazobenzyloxymethyl-paper and hybridization with DNA probes</bold>
</article-title>
<source>Proc Nat Acad Sci</source>
<year>1977</year>
<volume>74</volume>
<issue>12</issue>
<fpage>5350</fpage>
<pub-id pub-id-type="doi">10.1073/pnas.74.12.5350</pub-id>
<pub-id pub-id-type="pmid">414220</pub-id>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="journal"><name><surname>Burnette</surname>
<given-names>WN</given-names>
</name>
<article-title><bold>Western blotting: Electrophoretic transfer of proteins from sodium
dodecly sulfate-polyacrylamide gels to unmodified nitrocellulose and
radiographic detection with antibody and radioiodinated protein A</bold>
</article-title>
<source>Anal Biochem</source>
<year>1981</year>
<volume>112</volume>
<fpage>195</fpage>
<lpage>203</lpage>
<pub-id pub-id-type="doi">10.1016/0003-2697(81)90281-5</pub-id>
<pub-id pub-id-type="pmid">6266278</pub-id>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="journal"><name><surname>De Bruijn</surname>
<given-names>B</given-names>
</name>
<name><surname>Martin</surname>
<given-names>J</given-names>
</name>
<article-title><bold>Getting to the (c)ore of knowledge: mining biomedical literature</bold>
</article-title>
<source>Int J Med Inform</source>
<year>2002</year>
<volume>67</volume>
<issue>1–3</issue>
<fpage>7</fpage>
<lpage>18</lpage>
<pub-id pub-id-type="pmid">12460628</pub-id>
</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="book"><name><surname>Murphy</surname>
<given-names>RF</given-names>
</name>
<name><surname>Kou</surname>
<given-names>Z</given-names>
</name>
<name><surname>Hua</surname>
<given-names>J</given-names>
</name>
<name><surname>Joffe</surname>
<given-names>M</given-names>
</name>
<name><surname>Cohen</surname>
<given-names>WW</given-names>
</name>
<article-title><bold>Extracting and structuring subcellular location information from on-line
journal articles: The subcellular location image finder</bold>
</article-title>
<source>Proceedings of the IASTED International Conference on Knowledge Sharing and
Collaborative Engineering (KSCE-2004)</source>
<year>2004</year>
<publisher-name>Calgary, AB Canada: ACTA Press</publisher-name>
<fpage>109</fpage>
<lpage>114</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://www.actapress.com/Abstract.aspx?paperId=17244">http://www.actapress.com/Abstract.aspx?paperId=17244</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="journal"><name><surname>Qian</surname>
<given-names>Y</given-names>
</name>
<name><surname>Murphy</surname>
<given-names>RF</given-names>
</name>
<article-title><bold>Improved recognition of figures containing fluorescence microscope images
in online journal articles using graphical models</bold>
</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<issue>4</issue>
<fpage>569</fpage>
<lpage>576</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btm561">http://dx.doi.org/10.1093/bioinformatics/btm561</ext-link>
]</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm561</pub-id>
<pub-id pub-id-type="pmid">18033795</pub-id>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><name><surname>Kozhenkov</surname>
<given-names>S</given-names>
</name>
<name><surname>Baitaluk</surname>
<given-names>M</given-names>
</name>
<article-title><bold>Mining and integration of pathway diagrams from imaging data</bold>
</article-title>
<source>Bioinformatics</source>
<year>2012</year>
<volume>28</volume>
<issue>5</issue>
<fpage>739</fpage>
<lpage>742</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/bts018">http://dx.doi.org/10.1093/bioinformatics/bts018</ext-link>
]</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bts018</pub-id>
<pub-id pub-id-type="pmid">22267504</pub-id>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="journal"><name><surname>Lemkin</surname>
<given-names>PF</given-names>
</name>
<article-title><bold>Comparing two-dimensional electrophoretic gel images across the
Internet</bold>
</article-title>
<source>Electrophoresis</source>
<year>1997</year>
<volume>18</volume>
<issue>3–4</issue>
<fpage>461</fpage>
<lpage>470</lpage>
<pub-id pub-id-type="pmid">9150925</pub-id>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="journal"><name><surname>Luhn</surname>
<given-names>S</given-names>
</name>
<name><surname>Berth</surname>
<given-names>M</given-names>
</name>
<name><surname>Hecker</surname>
<given-names>M</given-names>
</name>
<name><surname>Bernhardt</surname>
<given-names>J</given-names>
</name>
<article-title><bold>Using standard positions and image fusion to create proteome maps from
collections of two-dimensional gel electrophoresis images</bold>
</article-title>
<source>Proteomics</source>
<year>2003</year>
<volume>3</volume>
<issue>7</issue>
<fpage>1117</fpage>
<lpage>1127</lpage>
<pub-id pub-id-type="doi">10.1002/pmic.200300433</pub-id>
<pub-id pub-id-type="pmid">12872213</pub-id>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="journal"><name><surname>Cutler</surname>
<given-names>P</given-names>
</name>
<name><surname>Heald</surname>
<given-names>G</given-names>
</name>
<name><surname>White</surname>
<given-names>IR</given-names>
</name>
<name><surname>Ruan</surname>
<given-names>J</given-names>
</name>
<article-title><bold>A novel approach to spot detection for two-dimensional gel
electrophoresis images using pixel value collection</bold>
</article-title>
<source>Proteomics</source>
<year>2003</year>
<volume>3</volume>
<issue>4</issue>
<fpage>392</fpage>
<lpage>401</lpage>
<pub-id pub-id-type="doi">10.1002/pmic.200390054</pub-id>
<pub-id pub-id-type="pmid">12687607</pub-id>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="journal"><name><surname>Rogers</surname>
<given-names>M</given-names>
</name>
<name><surname>Graham</surname>
<given-names>J</given-names>
</name>
<name><surname>Tonge</surname>
<given-names>RP</given-names>
</name>
<article-title><bold>Statistical models of shape for the analysis of protein spots in
two-dimensional electrophoresis gel images</bold>
</article-title>
<source>Proteomics</source>
<year>2003</year>
<volume>3</volume>
<issue>6</issue>
<fpage>887</fpage>
<lpage>896</lpage>
<pub-id pub-id-type="doi">10.1002/pmic.200300421</pub-id>
<pub-id pub-id-type="pmid">12833512</pub-id>
</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="journal"><name><surname>Zerr</surname>
<given-names>T</given-names>
</name>
<name><surname>Henikoff</surname>
<given-names>S</given-names>
</name>
<article-title><bold>Automated band mapping in electrophoretic gel images using background
information</bold>
</article-title>
<source>Nucleic Acids Res</source>
<year>2005</year>
<volume>33</volume>
<issue>9</issue>
<fpage>2806</fpage>
<lpage>2812</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gki580</pub-id>
<pub-id pub-id-type="pmid">15894797</pub-id>
</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><name><surname>Schlamp</surname>
<given-names>K</given-names>
</name>
<name><surname>Weinmann</surname>
<given-names>A</given-names>
</name>
<name><surname>Krupp</surname>
<given-names>M</given-names>
</name>
<name><surname>Maass</surname>
<given-names>T</given-names>
</name>
<name><surname>Galle</surname>
<given-names>P</given-names>
</name>
<name><surname>Teufel</surname>
<given-names>A</given-names>
</name>
<article-title><bold>BlotBase: a northern blot database</bold>
</article-title>
<source>Gene</source>
<year>2008</year>
<volume>427</volume>
<issue>1–2</issue>
<fpage>47</fpage>
<lpage>50</lpage>
<pub-id pub-id-type="pmid">18838116</pub-id>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="journal"><name><surname>Rodriguez-Esteban</surname>
<given-names>R</given-names>
</name>
<name><surname>Iossifov</surname>
<given-names>I</given-names>
</name>
<article-title><bold>Figure mining for biomedical research</bold>
</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>16</issue>
<fpage>2082</fpage>
<lpage>2084</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btp318">http://dx.doi.org/10.1093/bioinformatics/btp318</ext-link>
]</comment>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp318</pub-id>
<pub-id pub-id-type="pmid">19439564</pub-id>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="journal"><name><surname>Ramakrishnan</surname>
<given-names>C</given-names>
</name>
<name><surname>Patnia</surname>
<given-names>A</given-names>
</name>
<name><surname>Hovy</surname>
<given-names>EH</given-names>
</name>
<name><surname>Burns</surname>
<given-names>GAPC</given-names>
</name>
<article-title><bold>Layout-aware text extraction from full-text PDF of scientific
articles</bold>
</article-title>
<source>Source Code Biol Med</source>
<year>2012</year>
<volume>7</volume>
<issue>1</issue>
<fpage>255</fpage>
<lpage>258</lpage>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="journal"><name><surname>Xu</surname>
<given-names>S</given-names>
</name>
<name><surname>Krauthammer</surname>
<given-names>M</given-names>
</name>
<article-title><bold>A new pivoting and iterative text detection algorithm for biomedical
images</bold>
</article-title>
<source>J Biomed Inform</source>
<year>2010</year>
<volume>43</volume>
<issue>6</issue>
<fpage>924</fpage>
<lpage>931</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/j.jbi.2010.09.006">http://dx.doi.org/10.1016/j.jbi.2010.09.006</ext-link>
]</comment>
<pub-id pub-id-type="doi">10.1016/j.jbi.2010.09.006</pub-id>
<pub-id pub-id-type="pmid">20887803</pub-id>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="book"><name><surname>Xu</surname>
<given-names>S</given-names>
</name>
<name><surname>Krauthammer</surname>
<given-names>M</given-names>
</name>
<article-title><bold>Boosting text extraction from biomedical images using text region
detection</bold>
</article-title>
<source>Biomedical Sciences and Engineering Conference (BSEC), 2011</source>
<year>2011</year>
<publisher-name>New York City, NY USA: IEEE</publisher-name>
<fpage>1</fpage>
<lpage>4</lpage>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><name><surname>Haralick</surname>
<given-names>RM</given-names>
</name>
<name><surname>Shanmugam</surname>
<given-names>K</given-names>
</name>
<name><surname>Dinstein</surname>
<given-names>I</given-names>
</name>
<article-title><bold>Textural features for image classification</bold>
</article-title>
<source>IEEE Trans Syst Man Cybernet</source>
<year>1973</year>
<volume>3</volume>
<issue>6</issue>
<fpage>610</fpage>
<lpage>621</lpage>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1109/TSMC.1973.4309314">http://dx.doi.org/10.1109/TSMC.1973.4309314</ext-link>
]</comment>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><name><surname>Cooper</surname>
<given-names>GF</given-names>
</name>
<name><surname>Herskovits</surname>
<given-names>E</given-names>
</name>
<article-title><bold>A Bayesian method for the induction of probabilistic networks from
data</bold>
</article-title>
<source>Mach Learn</source>
<year>1992</year>
<volume>9</volume>
<issue>4</issue>
<fpage>309</fpage>
<lpage>347</lpage>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="book"><name><surname>Frank</surname>
<given-names>E</given-names>
</name>
<name><surname>Witten</surname>
<given-names>IH</given-names>
</name>
<article-title><bold>Generating accurate rule sets without global optimization</bold>
</article-title>
<source>Proceedings of the Fifteenth International Conference on Machine
Learning</source>
<year>1998</year>
<publisher-name>Burlington, MA, USA: Morgan Kaufmann Publishers</publisher-name>
<fpage>144</fpage>
<lpage>151</lpage>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="other"><name><surname>LeCun</surname>
<given-names>Y</given-names>
</name>
<name><surname>Bengio</surname>
<given-names>Y</given-names>
</name>
<article-title><bold>Convolutional networks for images, speech, and time series</bold>
</article-title>
<comment>MA, USA: MIT Press Cambridge; 1995</comment>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><name><surname>Bradley</surname>
<given-names>AP</given-names>
</name>
<article-title><bold>The use of the area under the ROC curve in the evaluation of machine
learning algorithms</bold>
</article-title>
<source>Pattern Recognit</source>
<year>1997</year>
<volume>30</volume>
<issue>7</issue>
<fpage>1145</fpage>
<lpage>1159</lpage>
<pub-id pub-id-type="doi">10.1016/S0031-3203(96)00142-2</pub-id>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="journal"><name><surname>Tanabe</surname>
<given-names>L</given-names>
</name>
<name><surname>Wilbur</surname>
<given-names>WJ</given-names>
</name>
<article-title><bold>Tagging gene and protein names in biomedical text</bold>
</article-title>
<source>Bioinformatics</source>
<year>2002</year>
<volume>18</volume>
<issue>8</issue>
<fpage>1124</fpage>
<lpage>1132</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/18.8.1124</pub-id>
<pub-id pub-id-type="pmid">12176836</pub-id>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="journal"><name><surname>Lu</surname>
<given-names>Z</given-names>
</name>
<name><surname>Kao</surname>
<given-names>HY</given-names>
</name>
<name><surname>Wei</surname>
<given-names>CH</given-names>
</name>
<name><surname>Huang</surname>
<given-names>M</given-names>
</name>
<name><surname>Liu</surname>
<given-names>J</given-names>
</name>
<name><surname>Kuo</surname>
<given-names>CJ</given-names>
</name>
<name><surname>Hsu</surname>
<given-names>CN</given-names>
</name>
<name><surname>Tsai</surname>
<given-names>R</given-names>
</name>
<name><surname>Dai</surname>
<given-names>HJ</given-names>
</name>
<name><surname>Okazaki</surname>
<given-names>N</given-names>
</name>
<name><surname>Cho</surname>
<given-names>HC</given-names>
</name>
<name><surname>Gerner</surname>
<given-names>M</given-names>
</name>
<name><surname>Solt</surname>
<given-names>I</given-names>
</name>
<name><surname>Agarwal</surname>
<given-names>S</given-names>
</name>
<name><surname>Liu</surname>
<given-names>F</given-names>
</name>
<name><surname>Vishnyakova</surname>
<given-names>D</given-names>
</name>
<name><surname>Ruch</surname>
<given-names>P</given-names>
</name>
<name><surname>Romacker</surname>
<given-names>M</given-names>
</name>
<name><surname>Rinaldi</surname>
<given-names>F</given-names>
</name>
<name><surname>Bhattacharya</surname>
<given-names>S</given-names>
</name>
<name><surname>Srinivasan</surname>
<given-names>P</given-names>
</name>
<name><surname>Liu</surname>
<given-names>H</given-names>
</name>
<name><surname>Torii</surname>
<given-names>M</given-names>
</name>
<name><surname>Matos</surname>
<given-names>S</given-names>
</name>
<name><surname>Campos</surname>
<given-names>D</given-names>
</name>
<name><surname>Verspoor</surname>
<given-names>K</given-names>
</name>
<name><surname>Livingston</surname>
<given-names>KM</given-names>
</name>
<name><surname>Wilbur</surname>
<given-names>WJ</given-names>
</name>
<article-title><bold>The gene normalization task in BioCreative III</bold>
</article-title>
<source>BMC Bioinform</source>
<year>2011</year>
<volume>12</volume>
<issue>Suppl 8</issue>
<fpage>S2</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-12-S8-S2</pub-id>
</mixed-citation>
</ref>
<ref id="B32"><mixed-citation publication-type="book"><name><surname>Barbano</surname>
<given-names>PE</given-names>
</name>
<name><surname>Nagy</surname>
<given-names>ML</given-names>
</name>
<name><surname>Krauthammer</surname>
<given-names>M</given-names>
</name>
<article-title><bold>Energy-based architecture for classification of publication figures</bold>
</article-title>
<source>Proceedings of the Biomedical Science and Engineering Center Conference
(BSEC 2013)</source>
<year>2013</year>
<publisher-name>New York, City, NY, USA: IEEE</publisher-name>
</mixed-citation>
</ref>
<ref id="B33"><mixed-citation publication-type="book"><name><surname>Sermanet</surname>
<given-names>P</given-names>
</name>
<name><surname>Kavukcuoglu</surname>
<given-names>K</given-names>
</name>
<name><surname>LeCun</surname>
<given-names>Y</given-names>
</name>
<article-title><bold>EBlearn: Open-source energy-based learning in C++</bold>
</article-title>
<source>Proceedings of the 21st International Conference on Tools with Artificial
Intelligence (ICTAI’09)</source>
<year>2009</year>
<publisher-name>New York City, NY, USA: IEEE</publisher-name>
<fpage>693</fpage>
<lpage>697</lpage>
</mixed-citation>
</ref>
<ref id="B34"><mixed-citation publication-type="journal"><name><surname>Krizhevsky</surname>
<given-names>A</given-names>
</name>
<name><surname>Sutskever</surname>
<given-names>I</given-names>
</name>
<name><surname>Hinton</surname>
<given-names>G</given-names>
</name>
<article-title><bold>ImageNet classification with deep convolutional neural networks</bold>
</article-title>
<source>Adv Neural Inform Process Syst</source>
<year>2012</year>
<volume>25</volume>
<fpage>1106</fpage>
<lpage>1114</lpage>
</mixed-citation>
</ref>
<ref id="B35"><mixed-citation publication-type="journal"><name><surname>Ciccarese</surname>
<given-names>P</given-names>
</name>
<name><surname>Ocana</surname>
<given-names>M</given-names>
</name>
<name><surname>Garcia Castro</surname>
<given-names>LJ</given-names>
</name>
<name><surname>Das</surname>
<given-names>S</given-names>
</name>
<name><surname>Clark</surname>
<given-names>T</given-names>
</name>
<article-title><bold>An open annotation ontology for science on web 3.0</bold>
</article-title>
<source>J Biomed Semantics</source>
<year>2011</year>
<volume>2</volume>
<issue>Suppl 2</issue>
<fpage>S4</fpage>
<pub-id pub-id-type="doi">10.1186/2041-1480-2-S2-S4</pub-id>
<pub-id pub-id-type="pmid">21624159</pub-id>
</mixed-citation>
</ref>
<ref id="B36"><mixed-citation publication-type="other"><name><surname>Sanderson</surname>
<given-names>R</given-names>
</name>
<name><surname>Ciccarese</surname>
<given-names>P</given-names>
</name>
<name><surname>Van de Sompel</surname>
<given-names>H</given-names>
</name>
<article-title><bold>Open annotation data model</bold>
</article-title>
<source>Community draft W3C</source>
<year>2013</year>
<comment>[<ext-link ext-link-type="uri" xlink:href="http://www.openannotation.org/spec/core/20130208/index.html">http://www.openannotation.org/spec/core/20130208/index.html</ext-link>
]</comment>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Curation

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000109 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000109 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:4190668
   |texte=   Mining images in biomedical publications: Detection and analysis of gel
diagrams
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:24568573" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Mining images in biomedical publications: Detection and analysis of gel diagrams

Mining images in biomedical publications: Detection and analysis of gel diagrams

Source :

Abstract

Links toward previous steps (curation, corpus...)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki