Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
Identifieur interne : 000242 ( Ncbi/Curation ); précédent : 000241; suivant : 000243Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
Auteurs : Stefan Senger ; Luca Bartek ; George Papadatos ; Anna GaultonSource :
- Journal of Cheminformatics [ 1758-2946 ] ; 2015.
Abstract
First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.
When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.
In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant ‘gold standards’ is required.
The online version of this article (doi:10.1186/s13321-015-0097-z) contains supplementary material, which is available to authorized users.
Url:
DOI: 10.1186/s13321-015-0097-z
PubMed: 26457120
PubMed Central: 4594083
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000038
- to stream Pmc, to step Curation: Pour aller vers cette notice dans l'étape Curation :000038
- to stream Pmc, to step Checkpoint: Pour aller vers cette notice dans l'étape Curation :000019
- to stream Ncbi, to step Merge: Pour aller vers cette notice dans l'étape Curation :000242
Links to Exploration step
PMC:4594083Curation
No country items
Stefan Senger<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents</title>
<author><name sortKey="Senger, Stefan" sort="Senger, Stefan" uniqKey="Senger S" first="Stefan" last="Senger">Stefan Senger</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Bartek, Luca" sort="Bartek, Luca" uniqKey="Bartek L" first="Luca" last="Bartek">Luca Bartek</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Papadatos, George" sort="Papadatos, George" uniqKey="Papadatos G" first="George" last="Papadatos">George Papadatos</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Gaulton, Anna" sort="Gaulton, Anna" uniqKey="Gaulton A" first="Anna" last="Gaulton">Anna Gaulton</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26457120</idno>
<idno type="pmc">4594083</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4594083</idno>
<idno type="RBID">PMC:4594083</idno>
<idno type="doi">10.1186/s13321-015-0097-z</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000038</idno>
<idno type="wicri:Area/Pmc/Curation">000038</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000019</idno>
<idno type="wicri:Area/Ncbi/Merge">000242</idno>
<idno type="wicri:Area/Ncbi/Curation">000242</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents</title>
<author><name sortKey="Senger, Stefan" sort="Senger, Stefan" uniqKey="Senger S" first="Stefan" last="Senger">Stefan Senger</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Bartek, Luca" sort="Bartek, Luca" uniqKey="Bartek L" first="Luca" last="Bartek">Luca Bartek</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Papadatos, George" sort="Papadatos, George" uniqKey="Papadatos G" first="George" last="Papadatos">George Papadatos</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Gaulton, Anna" sort="Gaulton, Anna" uniqKey="Gaulton A" first="Anna" last="Gaulton">Anna Gaulton</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
</analytic>
<series><title level="j">Journal of Cheminformatics</title>
<idno type="eISSN">1758-2946</idno>
<imprint><date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.</p>
</sec>
<sec><title>Results</title>
<p>When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.</p>
</sec>
<sec><title>Conclusions</title>
<p>In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant ‘gold standards’ is required.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13321-015-0097-z) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Bregonje, M" uniqKey="Bregonje M">M Bregonje</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Chambers, J" uniqKey="Chambers J">J Chambers</name>
</author>
<author><name sortKey="Davies, M" uniqKey="Davies M">M Davies</name>
</author>
<author><name sortKey="Gaulton, A" uniqKey="Gaulton A">A Gaulton</name>
</author>
<author><name sortKey="Hersey, A" uniqKey="Hersey A">A Hersey</name>
</author>
<author><name sortKey="Velankar, S" uniqKey="Velankar S">S Velankar</name>
</author>
<author><name sortKey="Petryszak, R" uniqKey="Petryszak R">R Petryszak</name>
</author>
<author><name sortKey="Hastings, J" uniqKey="Hastings J">J Hastings</name>
</author>
<author><name sortKey="Bellis, L" uniqKey="Bellis L">L Bellis</name>
</author>
<author><name sortKey="Mcglinchey, S" uniqKey="Mcglinchey S">S McGlinchey</name>
</author>
<author><name sortKey="Overington, Jp" uniqKey="Overington J">JP Overington</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Akhondi, Sa" uniqKey="Akhondi S">SA Akhondi</name>
</author>
<author><name sortKey="Klenner, Ag" uniqKey="Klenner A">AG Klenner</name>
</author>
<author><name sortKey="Tyrchan, C" uniqKey="Tyrchan C">C Tyrchan</name>
</author>
<author><name sortKey="Manchala, Ak" uniqKey="Manchala A">AK Manchala</name>
</author>
<author><name sortKey="Boppana, K" uniqKey="Boppana K">K Boppana</name>
</author>
<author><name sortKey="Lowe, D" uniqKey="Lowe D">D Lowe</name>
</author>
<author><name sortKey="Zimmermann, M" uniqKey="Zimmermann M">M Zimmermann</name>
</author>
<author><name sortKey="Jagarlapudi, Sarp" uniqKey="Jagarlapudi S">SARP Jagarlapudi</name>
</author>
<author><name sortKey="Sayle, R" uniqKey="Sayle R">R Sayle</name>
</author>
<author><name sortKey="Kors, Ja" uniqKey="Kors J">JA Kors</name>
</author>
<author><name sortKey="Muresan, S" uniqKey="Muresan S">S Muresan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Heller, S" uniqKey="Heller S">S Heller</name>
</author>
<author><name sortKey="Mcnaught, A" uniqKey="Mcnaught A">A McNaught</name>
</author>
<author><name sortKey="Pletnev, I" uniqKey="Pletnev I">I Pletnev</name>
</author>
<author><name sortKey="Stein, S" uniqKey="Stein S">S Stein</name>
</author>
<author><name sortKey="Tchekhovskoi, D" uniqKey="Tchekhovskoi D">D Tchekhovskoi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Southan, C" uniqKey="Southan C">C Southan</name>
</author>
<author><name sortKey="Varkonyi, P" uniqKey="Varkonyi P">P Varkonyi</name>
</author>
<author><name sortKey="Boppana, K" uniqKey="Boppana K">K Boppana</name>
</author>
<author><name sortKey="Jagarlapudi, Sarp" uniqKey="Jagarlapudi S">SARP Jagarlapudi</name>
</author>
<author><name sortKey="Muresan, S" uniqKey="Muresan S">S Muresan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Hattori, K" uniqKey="Hattori K">K Hattori</name>
</author>
<author><name sortKey="Wakabayashi, H" uniqKey="Wakabayashi H">H Wakabayashi</name>
</author>
<author><name sortKey="Tamaki, K" uniqKey="Tamaki K">K Tamaki</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tyrchan, C" uniqKey="Tyrchan C">C Tyrchan</name>
</author>
<author><name sortKey="Bostrom, J" uniqKey="Bostrom J">J Boström</name>
</author>
<author><name sortKey="Giordanetto, F" uniqKey="Giordanetto F">F Giordanetto</name>
</author>
<author><name sortKey="Winter, J" uniqKey="Winter J">J Winter</name>
</author>
<author><name sortKey="Muresan, S" uniqKey="Muresan S">S Muresan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Ncbi/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000242 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Curation/biblio.hfd -nk 000242 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Ncbi |étape= Curation |type= RBID |clé= PMC:4594083 |texte= Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Curation/RBID.i -Sk "pubmed:26457120" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Curation/biblio.hfd \ | NlmPubMed2Wicri -a OcrV1
This area was generated with Dilib version V0.6.32. |