Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

ChemEx: information extraction system for chemical data curation

Identifieur interne : 000098 ( Pmc/Curation ); précédent : 000097; suivant : 000099

ChemEx: information extraction system for chemical data curation

Auteurs : Atima Tharatipyakul [Thaïlande] ; Somrak Numnark [Thaïlande] ; Duangdao Wichadakul [Thaïlande] ; Supawadee Ingsriswang [Thaïlande]

Source :

RBID : PMC:3521388

Abstract

Background

Manual chemical data curation from publications is error-prone, time consuming, and hard to maintain up-to-date data sets. Automatic information extraction can be used as a tool to reduce these problems. Since chemical structures usually described in images, information extraction needs to combine structure image recognition and text mining together.

Results

We have developed ChemEx, a chemical information extraction system. ChemEx processes both text and images in publications. Text annotator is able to extract compound, organism, and assay entities from text content while structure image recognition enables translation of chemical raster images to machine readable format. A user can view annotated text along with summarized information of compounds, organism that produces those compounds, and assay tests.

Conclusions

ChemEx facilitates and speeds up chemical data curation by extracting compounds, organisms, and assays from a large collection of publications. The software and corpus can be downloaded from http://www.biotec.or.th/isl/ChemEx.


Url:
DOI: 10.1186/1471-2105-13-S17-S9
PubMed: 23282330
PubMed Central: 3521388

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:3521388

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">ChemEx: information extraction system for chemical data curation</title>
<author>
<name sortKey="Tharatipyakul, Atima" sort="Tharatipyakul, Atima" uniqKey="Tharatipyakul A" first="Atima" last="Tharatipyakul">Atima Tharatipyakul</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Numnark, Somrak" sort="Numnark, Somrak" uniqKey="Numnark S" first="Somrak" last="Numnark">Somrak Numnark</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wichadakul, Duangdao" sort="Wichadakul, Duangdao" uniqKey="Wichadakul D" first="Duangdao" last="Wichadakul">Duangdao Wichadakul</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Ingsriswang, Supawadee" sort="Ingsriswang, Supawadee" uniqKey="Ingsriswang S" first="Supawadee" last="Ingsriswang">Supawadee Ingsriswang</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23282330</idno>
<idno type="pmc">3521388</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521388</idno>
<idno type="RBID">PMC:3521388</idno>
<idno type="doi">10.1186/1471-2105-13-S17-S9</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000098</idno>
<idno type="wicri:Area/Pmc/Curation">000098</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">ChemEx: information extraction system for chemical data curation</title>
<author>
<name sortKey="Tharatipyakul, Atima" sort="Tharatipyakul, Atima" uniqKey="Tharatipyakul A" first="Atima" last="Tharatipyakul">Atima Tharatipyakul</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Numnark, Somrak" sort="Numnark, Somrak" uniqKey="Numnark S" first="Somrak" last="Numnark">Somrak Numnark</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Wichadakul, Duangdao" sort="Wichadakul, Duangdao" uniqKey="Wichadakul D" first="Duangdao" last="Wichadakul">Duangdao Wichadakul</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Ingsriswang, Supawadee" sort="Ingsriswang, Supawadee" uniqKey="Ingsriswang S" first="Supawadee" last="Ingsriswang">Supawadee Ingsriswang</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</nlm:aff>
<country xml:lang="fr">Thaïlande</country>
<wicri:regionArea>Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Manual chemical data curation from publications is error-prone, time consuming, and hard to maintain up-to-date data sets. Automatic information extraction can be used as a tool to reduce these problems. Since chemical structures usually described in images, information extraction needs to combine structure image recognition and text mining together.</p>
</sec>
<sec>
<title>Results</title>
<p>We have developed ChemEx, a chemical information extraction system. ChemEx processes both text and images in publications. Text annotator is able to extract compound, organism, and assay entities from text content while structure image recognition enables translation of chemical raster images to machine readable format. A user can view annotated text along with summarized information of compounds, organism that produces those compounds, and assay tests.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>ChemEx facilitates and speeds up chemical data curation by extracting compounds, organisms, and assays from a large collection of publications. The software and corpus can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://www.biotec.or.th/isl/ChemEx">http://www.biotec.or.th/isl/ChemEx</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bolton, Evan E" uniqKey="Bolton E">Evan E Bolton</name>
</author>
<author>
<name sortKey="Wang, Yanli" uniqKey="Wang Y">Yanli Wang</name>
</author>
<author>
<name sortKey="Thiessen, Paul A" uniqKey="Thiessen P">Paul A Thiessen</name>
</author>
<author>
<name sortKey="Bryant, Stephen H" uniqKey="Bryant S">Stephen H Bryant</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hanisch, D" uniqKey="Hanisch D">D Hanisch</name>
</author>
<author>
<name sortKey="Fundel, K" uniqKey="Fundel K">K Fundel</name>
</author>
<author>
<name sortKey="Mevissen, H T" uniqKey="Mevissen H">H-T Mevissen</name>
</author>
<author>
<name sortKey="Zimmer, R" uniqKey="Zimmer R">R Zimmer</name>
</author>
<author>
<name sortKey="Fluck, J" uniqKey="Fluck J">J Fluck</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cohen, Am" uniqKey="Cohen A">AM Cohen</name>
</author>
<author>
<name sortKey="Hersh, Wr" uniqKey="Hersh W">WR Hersh</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krallinger, M" uniqKey="Krallinger M">M Krallinger</name>
</author>
<author>
<name sortKey="Leitner, F" uniqKey="Leitner F">F Leitner</name>
</author>
<author>
<name sortKey="Rodriguez Penagos, C" uniqKey="Rodriguez Penagos C">C Rodriguez-Penagos</name>
</author>
<author>
<name sortKey="Valencia, A" uniqKey="Valencia A">A Valencia</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mcdaniel, Jr" uniqKey="Mcdaniel J">JR McDaniel</name>
</author>
<author>
<name sortKey="Balmuth, Jr" uniqKey="Balmuth J">JR Balmuth</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ibison, P" uniqKey="Ibison P">P Ibison</name>
</author>
<author>
<name sortKey="Jacquot, M" uniqKey="Jacquot M">M Jacquot</name>
</author>
<author>
<name sortKey="Kam, F" uniqKey="Kam F">F Kam</name>
</author>
<author>
<name sortKey="Neville, Ag" uniqKey="Neville A">AG Neville</name>
</author>
<author>
<name sortKey="Simpson, Rw" uniqKey="Simpson R">RW Simpson</name>
</author>
<author>
<name sortKey="Tonnelier, C" uniqKey="Tonnelier C">C Tonnelier</name>
</author>
<author>
<name sortKey="Venczel, T" uniqKey="Venczel T">T Venczel</name>
</author>
<author>
<name sortKey="Johnson, Ap" uniqKey="Johnson A">AP Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Valko, At" uniqKey="Valko A">AT Valko</name>
</author>
<author>
<name sortKey="Johnson, Ap" uniqKey="Johnson A">AP Johnson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Algorri, M E" uniqKey="Algorri M">M-E Algorri</name>
</author>
<author>
<name sortKey="Zimmermann, M" uniqKey="Zimmermann M">M Zimmermann</name>
</author>
<author>
<name sortKey="Friedrich, Cm" uniqKey="Friedrich C">CM Friedrich</name>
</author>
<author>
<name sortKey="Akle, S" uniqKey="Akle S">S Akle</name>
</author>
<author>
<name sortKey="Hofmann Apitius, M" uniqKey="Hofmann Apitius M">M Hofmann-Apitius</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Filippov, Iv" uniqKey="Filippov I">IV Filippov</name>
</author>
<author>
<name sortKey="Nicklaus, Mc" uniqKey="Nicklaus M">MC Nicklaus</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Park, J" uniqKey="Park J">J Park</name>
</author>
<author>
<name sortKey="Rosania, Gr" uniqKey="Rosania G">GR Rosania</name>
</author>
<author>
<name sortKey="Shedden, Ka" uniqKey="Shedden K">KA Shedden</name>
</author>
<author>
<name sortKey="Nguyen, M" uniqKey="Nguyen M">M Nguyen</name>
</author>
<author>
<name sortKey="Lyu, N" uniqKey="Lyu N">N Lyu</name>
</author>
<author>
<name sortKey="Saitou, K" uniqKey="Saitou K">K Saitou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Klinger, R" uniqKey="Klinger R">R Klinger</name>
</author>
<author>
<name sortKey="Kola Ik, C" uniqKey="Kola Ik C">C Kolářik</name>
</author>
<author>
<name sortKey="Fluck, J" uniqKey="Fluck J">J Fluck</name>
</author>
<author>
<name sortKey="Hofmann Apitius, M" uniqKey="Hofmann Apitius M">M Hofmann-Apitius</name>
</author>
<author>
<name sortKey="Friedrich, Cm" uniqKey="Friedrich C">CM Friedrich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, B" uniqKey="Sun B">B Sun</name>
</author>
<author>
<name sortKey="Tan, Q" uniqKey="Tan Q">Q Tan</name>
</author>
<author>
<name sortKey="Mitra, P" uniqKey="Mitra P">P Mitra</name>
</author>
<author>
<name sortKey="Giles, Cl" uniqKey="Giles C">CL Giles</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hamon, T" uniqKey="Hamon T">T Hamon</name>
</author>
<author>
<name sortKey="Grabar, N" uniqKey="Grabar N">N Grabar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yan, S" uniqKey="Yan S">S Yan</name>
</author>
<author>
<name sortKey="Spangler, Ws" uniqKey="Spangler W">WS Spangler</name>
</author>
<author>
<name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Newman, Dj" uniqKey="Newman D">DJ Newman</name>
</author>
<author>
<name sortKey="Cragg, Gm" uniqKey="Cragg G">GM Cragg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jessop, D" uniqKey="Jessop D">D Jessop</name>
</author>
<author>
<name sortKey="Adams, S" uniqKey="Adams S">S Adams</name>
</author>
<author>
<name sortKey="Willighagen, E" uniqKey="Willighagen E">E Willighagen</name>
</author>
<author>
<name sortKey="Hawizy, L" uniqKey="Hawizy L">L Hawizy</name>
</author>
<author>
<name sortKey="Murray Rust, P" uniqKey="Murray Rust P">P Murray-Rust</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hawizy, L" uniqKey="Hawizy L">L Hawizy</name>
</author>
<author>
<name sortKey="Jessop, D" uniqKey="Jessop D">D Jessop</name>
</author>
<author>
<name sortKey="Adams, N" uniqKey="Adams N">N Adams</name>
</author>
<author>
<name sortKey="Murray Rust, P" uniqKey="Murray Rust P">P Murray-Rust</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Corbett, P" uniqKey="Corbett P">P Corbett</name>
</author>
<author>
<name sortKey="Copestake, A" uniqKey="Copestake A">A Copestake</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Degtyarenko, K" uniqKey="Degtyarenko K">K Degtyarenko</name>
</author>
<author>
<name sortKey="De Matos, P" uniqKey="De Matos P">P de Matos</name>
</author>
<author>
<name sortKey="Ennis, M" uniqKey="Ennis M">M Ennis</name>
</author>
<author>
<name sortKey="Hastings, J" uniqKey="Hastings J">J Hastings</name>
</author>
<author>
<name sortKey="Zbinden, M" uniqKey="Zbinden M">M Zbinden</name>
</author>
<author>
<name sortKey="Mcnaught, A" uniqKey="Mcnaught A">A McNaught</name>
</author>
<author>
<name sortKey="Alcantara, R" uniqKey="Alcantara R">R Alcantara</name>
</author>
<author>
<name sortKey="Darsow, M" uniqKey="Darsow M">M Darsow</name>
</author>
<author>
<name sortKey="Guedj, M" uniqKey="Guedj M">M Guedj</name>
</author>
<author>
<name sortKey="Ashburner, M" uniqKey="Ashburner M">M Ashburner</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ingsriswang, S" uniqKey="Ingsriswang S">S Ingsriswang</name>
</author>
<author>
<name sortKey="Pacharawongsakda, E" uniqKey="Pacharawongsakda E">E Pacharawongsakda</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23282330</article-id>
<article-id pub-id-type="pmc">3521388</article-id>
<article-id pub-id-type="publisher-id">1471-2105-13-S17-S9</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-13-S17-S9</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Proceedings</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>ChemEx: information extraction system for chemical data curation</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" id="A1">
<name>
<surname>Tharatipyakul</surname>
<given-names>Atima</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>atima.tha@biotec.or.th</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Numnark</surname>
<given-names>Somrak</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>somrak.num@biotec.or.th</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Wichadakul</surname>
<given-names>Duangdao</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>duangdao.wic@biotec.or.th</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A4">
<name>
<surname>Ingsriswang</surname>
<given-names>Supawadee</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>supawadee@biotec.or.th</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Information Systems Laboratory, National Center for Genetic Engineering and Biotechnology (BIOTEC), 113 Thailand Science Park, Phaholyothin Road, Klong 1, Klong Luang, Pathumthani, Thailand</aff>
<pub-date pub-type="collection">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>7</day>
<month>12</month>
<year>2012</year>
</pub-date>
<volume>13</volume>
<issue>Suppl 17</issue>
<supplement>
<named-content content-type="supplement-title">Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics</named-content>
<named-content content-type="supplement-editor">Shoba Ranganathan, Christian Schönbach, Sissades Tongsima, Jonathan Chan and Tin Wee Tan</named-content>
<named-content content-type="supplement-sponsor">The articles in this supplement were supported by funding agencies as detailed in the Acknowledgement section of each article</named-content>
</supplement>
<fpage>S9</fpage>
<lpage>S9</lpage>
<permissions>
<copyright-statement>Copyright ©2012 Tharatipyakul et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2012</copyright-year>
<copyright-holder>Tharatipyakul et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/13/S17/S9"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Manual chemical data curation from publications is error-prone, time consuming, and hard to maintain up-to-date data sets. Automatic information extraction can be used as a tool to reduce these problems. Since chemical structures usually described in images, information extraction needs to combine structure image recognition and text mining together.</p>
</sec>
<sec>
<title>Results</title>
<p>We have developed ChemEx, a chemical information extraction system. ChemEx processes both text and images in publications. Text annotator is able to extract compound, organism, and assay entities from text content while structure image recognition enables translation of chemical raster images to machine readable format. A user can view annotated text along with summarized information of compounds, organism that produces those compounds, and assay tests.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>ChemEx facilitates and speeds up chemical data curation by extracting compounds, organisms, and assays from a large collection of publications. The software and corpus can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://www.biotec.or.th/isl/ChemEx">http://www.biotec.or.th/isl/ChemEx</ext-link>
.</p>
</sec>
</abstract>
<conference>
<conf-date>3-5 October 2012</conf-date>
<conf-name>Asia Pacific Bioinformatics Network (APBioNet) Eleventh International Conference on Bioinformatics (InCoB2012)</conf-name>
<conf-loc>Bangkok, Thailand</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Accurate chemical data curation is essential for cheminformatics. Nowadays, researchers or exploration software can access internal or external public databases [
<xref ref-type="bibr" rid="B1">1</xref>
,
<xref ref-type="bibr" rid="B2">2</xref>
] to retrieve necessary information. Still, the major source of knowledge is scientific literature. Unfortunately, information in the literature is unstructured or semi-structured, and written in natural language. Chemical structures were embedded in reports, journals, and patents in the form of images. These cannot be input into chemical databases or chemistry software directly. Manual reproducing the information is time-consuming and liable to errors. Furthermore, rapid growth of publications results in difficulty to maintain up-to-date data sets. To overcome these problems, automatic information extraction becomes a subject of interest.</p>
<p>Whereas there are numerous text-mining tools in biological domain [
<xref ref-type="bibr" rid="B3">3</xref>
-
<xref ref-type="bibr" rid="B6">6</xref>
], chemical information extraction had not received attention until recently. Existing techniques for the chemical information extraction can be broadly classified into two categories: visual and textual data extraction. The visual data extraction system, such as Kekulé [
<xref ref-type="bibr" rid="B7">7</xref>
], CLiDE [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B9">9</xref>
], chemOCR [
<xref ref-type="bibr" rid="B10">10</xref>
], OSRA [
<xref ref-type="bibr" rid="B11">11</xref>
], and ChemReader [
<xref ref-type="bibr" rid="B12">12</xref>
], focuses on interpretation of images embedded in documents while the textual data extraction focuses on mining interested entities and their relations from text. Textual data extraction is varied based on a subject domain, such as chemical names [
<xref ref-type="bibr" rid="B13">13</xref>
,
<xref ref-type="bibr" rid="B14">14</xref>
], chemical formulae [
<xref ref-type="bibr" rid="B14">14</xref>
], or drug names [
<xref ref-type="bibr" rid="B15">15</xref>
]. Information extraction from either image or text content results in missing information or semantic links between text and images. Therefore, a technique for combining two media [
<xref ref-type="bibr" rid="B16">16</xref>
] could be applied to improve knowledge discovery.</p>
<p>ChemEx is a software developed to assists a chemical data curation process. While it can be used with general chemical information extraction, ChemEx is designed for extracting information of natural products which are a major source of novel bioactive compounds or structures [
<xref ref-type="bibr" rid="B17">17</xref>
]. It provides a framework to integrate optical structure recognition and chemical text-mining software. The extracted information can be then visualized and exported to a database. Enormous chemical libraries become available with minimum time and effort.</p>
</sec>
<sec>
<title>Implementation</title>
<sec>
<title>System overview</title>
<p>ChemEx processes a collection of publications in order to extract information of bioactive compounds as well as an organism that produces those compounds with their bioactivity from each publication as illustrated in Figure
<xref ref-type="fig" rid="F1">1</xref>
. The system consists of four main modules: (a)
<italic>Document Preprocessor</italic>
, (b)
<italic>2D Chemical Structure Image Recognition</italic>
, (c)
<italic>Text Annotator</italic>
, and (d)
<italic>Information Viewer</italic>
.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Interested entities and their relations</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-1"></graphic>
</fig>
<p>Figure
<xref ref-type="fig" rid="F2">2</xref>
presents the workflow of the system. First, the
<italic>Document Preprocessor </italic>
transforms and segments each input literature into textual and visual data. The 2D
<italic>Chemical Structure Image Recognition </italic>
module then translates the visual data (images) into machine readable string whereas the
<italic>Text Annotator </italic>
module tags words in a subject domain. In the end, a user can visualize extracted information using the
<italic>Information Viewer</italic>
.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>System overview</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-2"></graphic>
</fig>
</sec>
<sec>
<title>Document preprocessor module</title>
<p>This module pre-processes publications so that they can be input to 2D structure image recognition and text annotator modules. ChemEx works with both electronically-generated PDFs and scanned PDFs. Poppler [
<xref ref-type="bibr" rid="B18">18</xref>
] is used to segment a PDF file into a set of images and plain text. Converting full text PDF had layout errors, which are, the header and footer were mixed-up with the content, and a paragraph was sometimes broken to multiple discontinuous paragraphs. Hence, if a bibliography file is available, text content will be extracted from the abstract field in BibTeX instead. In case of the scanned PDF, which text cannot be extracted from the PDF file, BibTeX is required for the system to work properly.</p>
<p>It was observed that "-" is usually extracted as an unknown character "?". Therefore, for example, ChemEx replaces "Aigialomycins A?E (2?6)" with "Aigialomycins A-E (2-6)".</p>
</sec>
<sec>
<title>2D chemical structure image recognition module</title>
<p>Structure images that are embedded in publications typically consist of two parts: 2D structure of chemical molecule and label of an identifier used for referencing later in the text content. The overview of this module is illustrated in Figure
<xref ref-type="fig" rid="F3">3</xref>
. This module consists of three following steps: (1)
<italic>Structure Recognition </italic>
which translates each 2D image of the chemical structure into machine readable format, (2)
<italic>Label Recognition </italic>
which identifies labels in a structure image, and (3)
<italic>Structure-Label Mapper </italic>
which constructs a mapping table between the label and file location of corresponding 2D structure.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>2D chemical structure image recognition overview</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-3"></graphic>
</fig>
<sec>
<title>Step 1: structure recognition</title>
<p>To retrieve machine readable structure from 2D chemical structure images, ChemEx uses an open source OSRA [
<xref ref-type="bibr" rid="B11">11</xref>
]. In this step, ChemEx recovers both SMILES [
<xref ref-type="bibr" rid="B19">19</xref>
] and MDL Molfile [
<xref ref-type="bibr" rid="B20">20</xref>
] from a 2D chemical structure image. Based on OSRA features, ChemEx recognizes atomic labels and charges, circle bond (old style aromatic rings), double and triple bonds, wedge and dash bonds, and bridge bonds.</p>
</sec>
<sec>
<title>Step 2: label recognition</title>
<p>ChemEx retrieves non-structure components of the 2D structure image to identify labels of the structure. There are two parts in this step:
<italic>Character Recognition </italic>
and
<italic>Pattern Recognition</italic>
.
<italic>Character Recognition </italic>
converts non-structure image components to text using GOCR [
<xref ref-type="bibr" rid="B21">21</xref>
]. If the text pattern matches with chemical label features [
<xref ref-type="bibr" rid="B16">16</xref>
], that image component is identified as a label. ChemEx recognizes Roman digits (e.g. I, VI, X), Arabic numeral digits (e.g., 1, 2, 10), digits connected by a dash (e.g., 1-1, 3-10), digits follows by a prime (e.g., 1', VI', 1-1'), and all previous features enclosed by parenthesis (e.g., (1), (VI), (5')).</p>
</sec>
<sec>
<title>Step 3: structure-label mapper</title>
<p>One structure image may consist of multiple labels. Also, a label may contain the identification number used for reference the structure as well as others, for instance, a compound name or R-group. To construct a structure-label mapping table, ChemEx's Structure-Label Mapper assigns each 2D structure (from step 1) to a nearest label (from step 2) using minimum weight graph matching algorithm [
<xref ref-type="bibr" rid="B16">16</xref>
]. Successful mapping is written into a file to be used in
<italic>Information Viewer</italic>
.</p>
</sec>
</sec>
<sec>
<title>Text annotator module</title>
<p>This module discovers interested entities and relations from textual information of publications. ChemEx employs a component called Analysis Engine (AE) from Unstructured Information Management Applications (UIMA) [
<xref ref-type="bibr" rid="B22">22</xref>
] to analyse document in four steps: (1)
<italic>Tokenizer</italic>
, (2)
<italic>Tagger</italic>
, (3)
<italic>Phase Parser and Identification</italic>
, and (4)
<italic>Coordination Resolution</italic>
. The processing flow among these steps is illustrated in Figure
<xref ref-type="fig" rid="F4">4</xref>
.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Text annotator workflow</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-4"></graphic>
</fig>
<sec>
<title>Step 1: tokenizer</title>
<p>Tokenizer splits a text stream into tokens of words. ChemEx uses the tokenizer from OSCAR4 [
<xref ref-type="bibr" rid="B23">23</xref>
] which is able to handle hyphens or other symbols in chemical terms such as 2-Amino-2-(hydroxymethyl)-1,3-propanediol hydrochloride. ChemEx also extends OSCAR's tokenizer to handle scientific name abbreviation, such as
<italic>Penicillium sp</italic>
. or
<italic>P. pacificum</italic>
.</p>
</sec>
<sec>
<title>Step 2: tagger</title>
<p>Tagger labels the interested word tokens in text. ChemEx tagger consists of
<italic>Chemical Entities Tagger</italic>
,
<italic>Organism Entities Tagger</italic>
, and
<italic>Assay Entities Tagger</italic>
.</p>
<p>ChemEx employs ChemicalTagger [
<xref ref-type="bibr" rid="B24">24</xref>
], which uses machine learning approach called Maximum Entropy Markov Model Recogniser [
<xref ref-type="bibr" rid="B25">25</xref>
], to (i) recognize chemical names, reaction names, enzymes, and chemistry-related terms such as experimental action verbs or units and (ii) tag general English word classes, such as a noun or a verb, which will be used in the phase parser. ChemEx uses all information from ChemicalTagger.</p>
<p>Organism and assay entities are tagged using dictionary-based approach. ChemEx extends ConceptMapper [
<xref ref-type="bibr" rid="B26">26</xref>
] which is a configurable dictionary UIMA-based annotator. The ConceptMapper allows a user to add or remove dictionaries according to domain of interest. The extended ConceptMapper keeps an identification number and database source so it is possible to retrieve further information of the entities. In case of organism, once scientific names, e.g.,
<italic>Escherichia coli</italic>
, are detected, the tagger abbreviates all scientific names and searches the text again for abbreviated scientific names such as
<italic>E. coli</italic>
. Also, the organism tagger extends a term to cover "sp.", "spec.", or "spp." for unspecified species.</p>
<p>A dictionary consists of a set of entries, specified by the < token > XML tag. Each entry contains one or more variants (synonyms, common names). Taxonomic ranks (phylum, family, genus, and species) are optional. For example:</p>
<p></p>
<p></p>
<p>phylum="basidiomycota" family="sirobasidiaceae" genus="sirobasidium"></p>
<p></p>
<p></p>
<p></p>
<p>phylum="basidiomycota" family="sirobasidiaceae" genus="sirobasidium"</p>
<p>species="brefeldianum"></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p>Currently, the dictionaries of scientific names used in ChemEx are derived from Integrated Taxonomic Information System (ITIS 545,485 records, accessed on 9th December 2011) [
<xref ref-type="bibr" rid="B27">27</xref>
], List of Prokaryotic names with Standing in Nomenclature (LPSN 14,390 records, accessed on 8th December 2011) [
<xref ref-type="bibr" rid="B28">28</xref>
], and Catalogue of Life (55,022 records of fungi domain, accessed on 5th December 2011) [
<xref ref-type="bibr" rid="B29">29</xref>
]. For assay, drug-related terms from Chemical Entities of Biological Interest (ChEBI) ontology (164 records, accessed on 13th February 2012) [
<xref ref-type="bibr" rid="B30">30</xref>
] were used.</p>
</sec>
<sec>
<title>Step 3: phase parser and identification</title>
<p>ChemicalTagger [
<xref ref-type="bibr" rid="B24">24</xref>
] also parses and identifies a sentence. Phase parser receives tagged token stream and builds grammatical structure based on predefined grammars. After text is parsed, experimental action phases, such as "Compound 1 was
<italic>added </italic>
to the solution" or "Compound 1 was
<italic>extracted </italic>
from compound 2", can be identified by analysing the grammatical structure. Numbers used for compound referencing (labels) are also identified in this step. Finally, ChemEx extracts natural products and their source organism from a ChemicalTagger's "yielded" phase, such as "Compound 1 was
<italic>isolated </italic>
from the fungus
<italic>Xylaria multiplex</italic>
".</p>
</sec>
<sec>
<title>Step 4: coordination resolution</title>
<p>Sometimes, especially in an abstract, multiple compounds appear in one sentence. The compounds are joined with punctuation marks or coordinate conjunctions. Exploring semantic meaning of a noun group of compounds thus improves knowledge discovery. Coordination resolution is to identify each compound in a compound chunk mentioned in text content. For example, "multiplolides A (1) and B (2)" consists of two compounds: multiplolide A, labeled as 1, and multiplolide B, labeled as 2. "drechslerines C-G (6-10)" consists of five compounds: drechslerine C (6), drechslerine D (7), drechslerine E (8), drechslerine F (9), and drechslerine G (10).</p>
<p>ChemEx uses a state machine (Figure
<xref ref-type="fig" rid="F5">5</xref>
) to recognize and interpret a compound group taking into account a label and series. The state machine processes on tagged token stream.
<italic>Text </italic>
state disregards non-chemical entity tokens.
<italic>Chemical Name </italic>
state accumulates a chemical name, either single or multiple words.
<italic>Series </italic>
and
<italic>Label </italic>
states are responsible for series and label token respectively. They also insert values in between two letters or numbers. For instance, "A-C" becomes "A, B, C", and "1-3" becomes "1, 2, 3".
<italic>And/To </italic>
state handles "and" and "to" token. For instance, "compounds A and B" becomes "compound A, compound B", and "compounds A to C" becomes "compound A, compound B, compound C". In the end, individual chemical names with series and label are generated as chemical entities.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>State machine diagram for coordinate resolution</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-5"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Information viewer module</title>
<p>Information viewer provides graphical interface to user for viewing the integrated results from all modules. ChemEx summarizes natural products and their bioassay tests reported in a publication. The viewer includes UIMA CAS Annotation Viewer [
<xref ref-type="bibr" rid="B22">22</xref>
] to display annotated text and JChemPaint [
<xref ref-type="bibr" rid="B31">31</xref>
] to reproduce structure thumbnails from MOL files generated by
<italic>2D Chemical Structure Image Recognition </italic>
module. Additionally, structure-label mapping tables generated by
<italic>2D Chemical Structure Image Recognition </italic>
module is combined with chemical compound entities extracted from
<italic>Text Annotator </italic>
module as illustrated in Figure
<xref ref-type="fig" rid="F6">6</xref>
. Therefore, a chemical compound entity can be viewed and searched with its 2D chemical structure image and SMILES. A user can use the viewer to visualize results and export those results to an XML file which can be imported to sMOL Explorer [
<xref ref-type="bibr" rid="B32">32</xref>
], a web-enabled database and exploration tool for Small MOLecules datasets.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Structure-label mapping tables usage</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-6"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Results and discussion</title>
<p>ChemEx is able to extract compound, organism, and assay entities from text content automatically. It also finds 2D chemical structure of each compound from images embedded in full text, and converts 2D chemical structure images to machine readable format. Results from ChemEx can be visualized through the information viewer as demonstrated in Figure
<xref ref-type="fig" rid="F7">7</xref>
. A user can view annotated text together with publication information, compound list, organism that produces those compounds, and assay tests. Each compound can be also searched for additional information from external databases [
<xref ref-type="bibr" rid="B2">2</xref>
,
<xref ref-type="bibr" rid="B30">30</xref>
] as well as edited by 2D chemical structure editor (Figure
<xref ref-type="fig" rid="F8">8</xref>
). Moreover, a user can view and export extracted information of all publications in a collection in one place (Figure
<xref ref-type="fig" rid="F9">9</xref>
).</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>Example of an information viewer for one document</bold>
. This main screen displays extracted information from a publication. The user can step through a collection via control buttons and export the collection to one single XML file. Structure information can be investigated further by clicking a 2D chemical structure image.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-7"></graphic>
</fig>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption>
<p>
<bold>Example of structure information from external databases</bold>
. This screen displays a structure name, SMILES, and path to the structure file extracted from a publication. The user can edit the structure file using JChemPaint. Furthermore, ChemEx uses extracted information to search through external databases via web services. The user can view the retrieved information from PubChem and ChEBI.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-8"></graphic>
</fig>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption>
<p>
<bold>Example of an information viewer for a collection</bold>
.</p>
</caption>
<graphic xlink:href="1471-2105-13-S17-S9-9"></graphic>
</fig>
<p>The system was tested using literatures from ACS Publications (accessed on 13th March 2012) [
<xref ref-type="bibr" rid="B33">33</xref>
]. The keywords used for literature retrieval were "fungus Thailand". All accessible research articles with the abstract and full text PDF were downloaded. In total, 89 publications were obtained, but the test set contained only 74 publications that reports compounds with 2D chemical structures.</p>
<p>Each full text was retrieved with its bibliography. All images, including but not limited to 2D chemical structure images, were extracted from each PDF. Accuracy of information extraction and 2D chemical structure image recognition were evaluated.</p>
<sec>
<title>Information extraction evaluation</title>
<p>Extracted information from text content, consisting of compounds, organisms, and assays were listed by the system and compared with manually listed entities. The results are shown in Table
<xref ref-type="table" rid="T1">1</xref>
. Note that entities were evaluated regardless of natural products or not.</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Extracted information from text content of the test set</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="left">Exact Matches</th>
<th align="left">Partial Matches</th>
<th align="left">False Positive</th>
<th align="left">False Negative</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Compounds</td>
<td align="left">203</td>
<td align="left">15</td>
<td align="left">41</td>
<td align="left">105</td>
<td align="left">83.20%</td>
<td align="left">62.85%</td>
</tr>
<tr>
<td align="left">Organisms</td>
<td align="left">91</td>
<td align="left">21</td>
<td align="left">3</td>
<td align="left">5</td>
<td align="left">96.81%</td>
<td align="left">77.78%</td>
</tr>
<tr>
<td align="left">Assays</td>
<td align="left">80</td>
<td align="left">0</td>
<td align="left">0</td>
<td align="left">15</td>
<td align="left">100.00%</td>
<td align="left">84.21%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The test set consisted of 89 publications with terms "fungus Thailand" from ACS Publications. Only 74 publications reported compounds with 2D chemical structures. Compounds, organisms, and assays were extracted from text content and compared with manually listed entities.</p>
</table-wrap-foot>
</table-wrap>
<p>An exact match was an extracted entity matching the whole term of a manually listed entity, whereas a partial match was an extracted entity matching some part of a manually listed entity. False positive (FP) was an unexpected result. False negative (FN) was a missing result. The exact match was defined as true positive (TP). By default, the partial match was classified as false negative. Precision and recall were defined as:</p>
<p>
<disp-formula>
<mml:math id="M1" name="1471-2105-13-S17-S9-i1" overflow="scroll">
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">Precision</mml:mtext>
</mml:mstyle>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">TP</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo class="MathClass-punc">,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula>
<mml:math id="M2" name="1471-2105-13-S17-S9-i2" overflow="scroll">
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">Recall</mml:mtext>
</mml:mstyle>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle class="text">
<mml:mtext class="textsf" mathvariant="sans-serif">TP</mml:mtext>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mi>.</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<sec>
<title>Compound entities</title>
<p>The main purpose of the experiment was to extract compound entities from a content of abstracts and discover their 2D depiction from images embedded in full text. Thus, only compounds that have 2D structure images were considered. Partial matches were considered as mismatches.</p>
<p>The system extracted compound entities with 83.20% precision and 62.85% recall. ChemicalTagger achieved 61.34% precision and 22.60% recall. As demonstrated in Table
<xref ref-type="table" rid="T2">2</xref>
, ChemEx increases precision and recall 21.85% and 40.25% respectively.</p>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>Extracted chemical entities from text content of the test set compared to ChemicalTagger</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="left">Exact Matches</th>
<th align="left">Partial Matches</th>
<th align="left">False Positive</th>
<th align="left">False Negative</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">ChemicalTagger</td>
<td align="left">73</td>
<td align="left">54</td>
<td align="left">46</td>
<td align="left">196</td>
<td align="left">61.34%</td>
<td align="left">22.60%</td>
</tr>
<tr>
<td align="left">ChemEx</td>
<td align="left">203</td>
<td align="left">15</td>
<td align="left">41</td>
<td align="left">105</td>
<td align="left">83.20%</td>
<td align="left">62.85%</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td align="left">+21.85%</td>
<td align="left">+40.25%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The main improvement resulted from ChemEx's
<italic>Coordination Resolution</italic>
, which, for example, recognized five instead of one compounds from "drechslerines C-G". The improvement could be smaller in case of full text where each compound is written separately.</p>
<p>Currently,
<italic>Coordination Resolution </italic>
recognizes a label that contains pure digits (i.e. 0, 1, 2). Future development could extend coordination resolution to recognize other types of label such as Roman digits or letters.</p>
</sec>
<sec>
<title>Organism entities</title>
<p>Organism recognition showed good performance with 96.81% precision and 77.78% recall. False negative were scientific names outside the domain of interest. Partial matches were scientific names that only a genus was detected without species. If partial matches were considered as true positive, the performance is up to 97.39% precision and 95.73% recall.</p>
<p>While dictionary-based text mining yields high precision, its recall may be low depending on dictionary size. However, large dictionaries increase processing time and memory usage. It is recommended to supply dictionaries according to the domain of interest.</p>
</sec>
<sec>
<title>Assay entities</title>
<p>Assay recognition achieved 100.00% of precision and 84.21% of recall. False negative was due to assay terms does not exist in the corpus.</p>
<p>Currently, ChemEx recognized only one-word assay, such as, "antifungal" or "cytotoxic". However, some assays were reported in a sentence, for example, "Compound 1 inhibited activity against the malarial parasite
<italic>Plasmodium falciparum</italic>
". Future development could apply phase parsing to recognize these assay phases.</p>
</sec>
</sec>
<sec>
<title>2D chemical structure image recognition evaluation</title>
<p>Compounds entities in publications were manually listed and used to search for corresponding chemical structure in PubChem [
<xref ref-type="bibr" rid="B2">2</xref>
]. The search was done automatically via web service and the most similar of each chemical structure was used in the evaluation. In total 204 structures were found and downloaded as the ground truth. Then CACTVS script [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B34">34</xref>
] evaluated structure similarity between ground truth and regenerated structures based on standard InChI [
<xref ref-type="bibr" rid="B35">35</xref>
].</p>
<p>ChemEx was able to map 144 structures (70.59%) to compound entities extracted from text content. Mapping error comes from imperfect image segmentation, OCR errors, and incomprehensive pattern in label recognition. Table
<xref ref-type="table" rid="T3">3</xref>
shows number of structures according to similarity score. "T > 70%" indicates the number of structure with similarity above 70%. There were 72 structures (35.29%) with the similarity score is above 70%. The average similarity of these 72 structures was 91.42%. ChemEx reconstructed 28 identical structures (13.73%). The average similarity between ground truth and regenerated structures was 71.86%.</p>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>Results of 2D chemical structure image recognition on the test set</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Thresholds Number of</th>
<th align="left">Structures (% to the total)</th>
<th align="left">Average Similarity Score</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Cannot find the InChI</td>
<td align="left">9 (4.41%)</td>
<td align="left">-</td>
</tr>
<tr>
<td align="left">T > 70%</td>
<td align="left">72 (35.29%)</td>
<td align="left">91.42</td>
</tr>
<tr>
<td align="left">T > 80%</td>
<td align="left">61 (29.90%)</td>
<td align="left">94.43</td>
</tr>
<tr>
<td align="left">T > 90%</td>
<td align="left">44 (21.57%)</td>
<td align="left">98.30</td>
</tr>
<tr>
<td align="left">Identical structure</td>
<td align="left">28 (13.73%)</td>
<td align="left">100.00</td>
</tr>
<tr>
<td align="left">Total mapped structure</td>
<td align="left">144 (70.59%)</td>
<td align="left">71.86</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>CACTVS script computed structure similarity between ground truth and regenerated structures based on standard InChI. In total 204 structures from PubChem were downloaded as the ground truth.</p>
</table-wrap-foot>
</table-wrap>
<p>Our experiment found that sometimes OSRA [
<xref ref-type="bibr" rid="B11">11</xref>
] recognized a graph as chemical structure. Image classification prior to
<italic>2D Chemical Structure Image Recognition </italic>
could improve accuracy and performance. Another major issue is that OSRA interests only structure images. Retrieving non-structure image components from OSRA may result in high segmentation error, which causes error in structure-label mapping. Future development could apply segments categorization [
<xref ref-type="bibr" rid="B16">16</xref>
] before using OSRA to cover this issue.</p>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusions</title>
<p>ChemEx automatically discovers chemical knowledge from a large collection of publications. It is built on top of multiple pieces of software [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B22">22</xref>
,
<xref ref-type="bibr" rid="B24">24</xref>
] allowing information extraction from both visual and textual content. The system extracts compound, organism, and assay information with flexible framework. A user can add new dictionaries to customize results according to the domain of interest. ChemEx information viewer integrates and visualizes results. To the best of our knowledge, ChemEx is the first system that provides these functionalities. Although the accuracy needs to be improved, ChemEx increases information understanding and assists a user on chemical data curation process. We believe it is one step towards fully automatic chemical data curation, which is useful for constructing large chemical structure libraries.</p>
</sec>
<sec>
<title>Availability and requirements</title>
<p>• Project name: ChemEx - Chemical Information Extraction.</p>
<p>• Project home page:
<ext-link ext-link-type="uri" xlink:href="http://www.biotec.or.th/isl/ChemEx">http://www.biotec.or.th/isl/ChemEx</ext-link>
.</p>
<p>• Operating system(s): Windows and Linux.</p>
<p>• Programming language: Java and C++.</p>
<p>• Other requirements: at least 2 GB of RAM. Other dependencies were listed in the home page.</p>
<p>• License: GNU GPL.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>AT developed the system and the manuscript. SN was responsible for building project homepage and software installer. SI and DW designed, supervised the project, and refined the manuscript. All authors read and approved the final manuscript.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>This work was supported by National Center for Genetic Engineering and Biotechnology (BIOTEC).</p>
<p>This article has been published as part of
<italic>BMC Bioinformatics </italic>
Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17">http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="other">
<article-title>ChemBank</article-title>
<ext-link ext-link-type="uri" xlink:href="http://chembank.broadinstitute.org/">http://chembank.broadinstitute.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Bolton</surname>
<given-names>Evan E</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Yanli</given-names>
</name>
<name>
<surname>Thiessen</surname>
<given-names>Paul A</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>Stephen H</given-names>
</name>
<article-title>PubChem: integrated platform of small molecules and biological activities</article-title>
<source>Annual Reports in Computational Chemistry</source>
<year>2008</year>
<volume>4</volume>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Hanisch</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Fundel</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Mevissen</surname>
<given-names>H-T</given-names>
</name>
<name>
<surname>Zimmer</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Fluck</surname>
<given-names>J</given-names>
</name>
<article-title>ProMiner: rule-based protein and gene entity recognition</article-title>
<source>BMC Bioinformatics</source>
<year>2005</year>
<volume>6</volume>
<issue>Suppl 1</issue>
<fpage>S14</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-6-S1-S14</pub-id>
<pub-id pub-id-type="pmid">15960826</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Cohen</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Hersh</surname>
<given-names>WR</given-names>
</name>
<article-title>A survey of current work in biomedical text mining</article-title>
<source>Briefings in Bioinformatics</source>
<year>2005</year>
<volume>6</volume>
<fpage>57</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="doi">10.1093/bib/6.1.57</pub-id>
<pub-id pub-id-type="pmid">15826357</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Krallinger</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Leitner</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Rodriguez-Penagos</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Valencia</surname>
<given-names>A</given-names>
</name>
<article-title>Overview of the protein-protein interaction annotation extraction task of BioCreative II</article-title>
<source>Genome Biology</source>
<year>2008</year>
<volume>9</volume>
<fpage>S4</fpage>
<pub-id pub-id-type="pmid">18834495</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="other">
<article-title>GENIA tagger</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.nactem.ac.uk/tsujii/GENIA/tagger/">http://www.nactem.ac.uk/tsujii/GENIA/tagger/</ext-link>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>McDaniel</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Balmuth</surname>
<given-names>JR</given-names>
</name>
<article-title>Kekule: OCR-optical chemical (structure) recognition</article-title>
<source>Journal of Chemical Information and Computer Sciences</source>
<year>1992</year>
<volume>32</volume>
<fpage>373</fpage>
<lpage>378</lpage>
<pub-id pub-id-type="doi">10.1021/ci00008a018</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Ibison</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Jacquot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kam</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Neville</surname>
<given-names>AG</given-names>
</name>
<name>
<surname>Simpson</surname>
<given-names>RW</given-names>
</name>
<name>
<surname>Tonnelier</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Venczel</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>AP</given-names>
</name>
<article-title>Chemical literature data extraction: The CLiDE Project</article-title>
<source>Journal of Chemical Information and Computer Sciences</source>
<year>1993</year>
<volume>33</volume>
<fpage>338</fpage>
<lpage>344</lpage>
<pub-id pub-id-type="doi">10.1021/ci00013a010</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Valko</surname>
<given-names>AT</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>AP</given-names>
</name>
<article-title>CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition</article-title>
<source>Journal of Chemical Information and Modeling</source>
<year>2009</year>
<volume>49</volume>
<fpage>780</fpage>
<lpage>787</lpage>
<pub-id pub-id-type="doi">10.1021/ci800449t</pub-id>
<pub-id pub-id-type="pmid">19298076</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="other">
<name>
<surname>Algorri</surname>
<given-names>M-E</given-names>
</name>
<name>
<surname>Zimmermann</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Friedrich</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Akle</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hofmann-Apitius</surname>
<given-names>M</given-names>
</name>
<article-title>Reconstruction of chemical molecules from images</article-title>
<source>29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2007. EMBS 2007. IEEE</source>
<year>2007</year>
<fpage>4609</fpage>
<lpage>4612</lpage>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<name>
<surname>Filippov</surname>
<given-names>IV</given-names>
</name>
<name>
<surname>Nicklaus</surname>
<given-names>MC</given-names>
</name>
<article-title>Optical structure recognition software to recover chemical information: OSRA, an open source solution</article-title>
<source>Journal of Chemical Information and Modeling</source>
<year>2009</year>
<volume>49</volume>
<fpage>740</fpage>
<lpage>743</lpage>
<pub-id pub-id-type="doi">10.1021/ci800067r</pub-id>
<pub-id pub-id-type="pmid">19434905</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="other">
<name>
<surname>Park</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Rosania</surname>
<given-names>GR</given-names>
</name>
<name>
<surname>Shedden</surname>
<given-names>KA</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lyu</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Saitou</surname>
<given-names>K</given-names>
</name>
<article-title>Automated extraction of chemical structure information from digital raster images</article-title>
<source>Chem Cent J</source>
<volume>3</volume>
<fpage>4</fpage>
<lpage>4</lpage>
<pub-id pub-id-type="pmid">19196483</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Klinger</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Kolářik</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Fluck</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hofmann-Apitius</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Friedrich</surname>
<given-names>CM</given-names>
</name>
<article-title>Detection of IUPAC and IUPAC-like chemical names</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<fpage>i268</fpage>
<lpage>i276</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn181</pub-id>
<pub-id pub-id-type="pmid">18586724</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="book">
<name>
<surname>Sun</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Mitra</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Giles</surname>
<given-names>CL</given-names>
</name>
<article-title>Extraction and search of chemical formulae in text documents on the web</article-title>
<source>Proceedings of the 16th international conference on World Wide Web</source>
<year>2007</year>
<publisher-name>New York, NY, USA: ACM</publisher-name>
<fpage>251</fpage>
<lpage>260</lpage>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="journal">
<name>
<surname>Hamon</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Grabar</surname>
<given-names>N</given-names>
</name>
<article-title>Linguistic approach for identification of medication names and related information in clinical narratives</article-title>
<source>Journal of the American Medical Informatics Association</source>
<year>2010</year>
<volume>17</volume>
<fpage>549</fpage>
<lpage>554</lpage>
<pub-id pub-id-type="doi">10.1136/jamia.2010.004036</pub-id>
<pub-id pub-id-type="pmid">20819862</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="book">
<name>
<surname>Yan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Spangler</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<person-group person-group-type="editor">Burgard W, Roth D</person-group>
<article-title>Cross media entity extraction and linkage for chemical documents</article-title>
<source>AAAI</source>
<year>2011</year>
<publisher-name>AAAI Press</publisher-name>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Newman</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Cragg</surname>
<given-names>GM</given-names>
</name>
<article-title>Natural products as sources of new drugs over the last 25 years</article-title>
<source>Journal of Natural Products</source>
<year>2007</year>
<volume>70</volume>
<fpage>461</fpage>
<lpage>477</lpage>
<pub-id pub-id-type="doi">10.1021/np068054v</pub-id>
<pub-id pub-id-type="pmid">17309302</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="other">
<article-title>Poppler - PDF rendering library</article-title>
<ext-link ext-link-type="uri" xlink:href="http://poppler.freedesktop.org/">http://poppler.freedesktop.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="other">
<article-title>Simplified molecular-input line-entry system</article-title>
<ext-link ext-link-type="uri" xlink:href="http://en.wikipedia.org/wiki/SMILES">http://en.wikipedia.org/wiki/SMILES</ext-link>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="other">
<article-title>Chemical table file</article-title>
<ext-link ext-link-type="uri" xlink:href="http://en.wikipedia.org/wiki/Chemical_table_file">http://en.wikipedia.org/wiki/Chemical_table_file</ext-link>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="other">
<article-title>GOCR: open-source character recognition</article-title>
<ext-link ext-link-type="uri" xlink:href="http://jocr.sourceforge.net/">http://jocr.sourceforge.net/</ext-link>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="other">
<article-title>Apache UIMA - Unstructured Information Management applications</article-title>
<ext-link ext-link-type="uri" xlink:href="http://uima.apache.org/">http://uima.apache.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Jessop</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Adams</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Willighagen</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Hawizy</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Murray-Rust</surname>
<given-names>P</given-names>
</name>
<article-title>OSCAR4: a flexible architecture for chemical text-mining</article-title>
<source>Journal of Cheminformatics</source>
<year>2011</year>
<volume>3</volume>
<fpage>41</fpage>
<pub-id pub-id-type="doi">10.1186/1758-2946-3-41</pub-id>
<pub-id pub-id-type="pmid">21999457</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Hawizy</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Jessop</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Adams</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Murray-Rust</surname>
<given-names>P</given-names>
</name>
<article-title>ChemicalTagger: A tool for semantic text-mining in chemistry</article-title>
<source>Journal of Cheminformatics</source>
<year>2011</year>
<volume>3</volume>
<fpage>17</fpage>
<pub-id pub-id-type="doi">10.1186/1758-2946-3-17</pub-id>
<pub-id pub-id-type="pmid">21575201</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Corbett</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Copestake</surname>
<given-names>A</given-names>
</name>
<article-title>Cascaded classifiers for confidence-based chemical named entity recognition</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>S4</fpage>
<pub-id pub-id-type="pmid">19025690</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="other">
<article-title>Apache UIMA ConceptMapper Annotator Documentation</article-title>
<ext-link ext-link-type="uri" xlink:href="http://uima.apache.org/d/uima-addons-current/ConceptMapper/ConceptMapperAnnotatorUserGuide.html">http://uima.apache.org/d/uima-addons-current/ConceptMapper/ConceptMapperAnnotatorUserGuide.html</ext-link>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="other">
<article-title>Integrated Taxonomic Information System</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.itis.gov/">http://www.itis.gov/</ext-link>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="other">
<article-title>List of Prokaryotic names with Standing in Nomenclature LPSN</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.bacterio.cict.fr/">http://www.bacterio.cict.fr/</ext-link>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="other">
<article-title>Catalogue of Life</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.catalogueoflife.org/">http://www.catalogueoflife.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Degtyarenko</surname>
<given-names>K</given-names>
</name>
<name>
<surname>de Matos</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Ennis</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hastings</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zbinden</surname>
<given-names>M</given-names>
</name>
<name>
<surname>McNaught</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Alcantara</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Darsow</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Guedj</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<article-title>ChEBI: a database and ontology for chemical entities of biological interest</article-title>
<source>Nucleic Acids Research</source>
<year>2008</year>
<volume>36</volume>
<fpage>D344</fpage>
<lpage>D350</lpage>
<pub-id pub-id-type="pmid">17932057</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="other">
<article-title>JChemPaint</article-title>
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/apps/mediawiki/cdk/index.php?title=JChemPaint">http://sourceforge.net/apps/mediawiki/cdk/index.php?title=JChemPaint</ext-link>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Ingsriswang</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Pacharawongsakda</surname>
<given-names>E</given-names>
</name>
<article-title>sMOL Explorer: an open source, web-enabled database and exploration tool for small MOLecules datasets</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>2498</fpage>
<lpage>2500</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm363</pub-id>
<pub-id pub-id-type="pmid">17660205</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="other">
<article-title>ACS Publications</article-title>
<ext-link ext-link-type="uri" xlink:href="http://pubs.acs.org/">http://pubs.acs.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="other">
<article-title>CACTVS Chemoinformatics Toolkit Academic</article-title>
<ext-link ext-link-type="uri" xlink:href="http://xemistry.com/">http://xemistry.com/</ext-link>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="other">
<article-title>IUPAC - International Union of Pure and Applied Chemistry: The IUPAC International Chemical Identifier (InChI)</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.iupac.org/home/publications/e-resources/inchi.html">http://www.iupac.org/home/publications/e-resources/inchi.html</ext-link>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000098 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000098 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:3521388
   |texte=   ChemEx: information extraction system for chemical data curation
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:23282330" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024