OcrV1, PascalFrancis, Corpus, bibRecord, 000102

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Identifieur interne : 000102 ( PascalFrancis/Corpus ); précédent : 000101; suivant : 000103

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Auteurs : Roger Sayle ; Paul Hongxing Xie ; Sorel Muresan

Source :

Journal of chemical information and modeling [ 1549-9596 ] ; 2012.

RBID : Pascal:12-0102438

Descripteurs français

Pascal (Inist)
- Fouille donnée, Texte, Bioinformatique, Ontologie, Reconnaissance caractère, Reconnaissance optique caractère, Rupture, Brevet, Propriété industrielle, Dictionnaire automatique, Industrie pharmaceutique, Nomenclature, Gène, Homme, Typographie, Erreur humaine, Correction automatique, Trait union, ..

English descriptors

KwdEn :
- Automatic correction, Automatic dictionary, Bioinformatics, Character recognition, Data mining, Gene, Human, Human error, Hyphen, Nomenclature, Ontology, Optical character recognition, Patent rights, Patents, Pharmaceutical industry, Rupture, Text, Typography.

Abstract

The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

A01	`01`	`1`		`@0 1549-9596`
A03		`1`		`@0 J. chem. inf. model.`
A05				`@2 52`
A06				`@2 1`
A08	`01`	`1`	`ENG`	`@1 Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction`
A11	`01`	`1`		`@1 SAYLE (Roger)`
A11	`02`	`1`		`@1 HONGXING XIE (Paul)`
A11	`03`	`1`		`@1 MURESAN (Sorel)`
A14	`01`			`@1 NextMove Software @2 Cambridge @3 GBR @Z 1 aut.`
A14	`02`			`@1 Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal @2 431 83 Mölndal @3 SWE @Z 2 aut. @Z 3 aut.`
A20				`@1 51-62`
A21				`@1 2012`
A23	`01`			`@0 ENG`
A43	`01`			`@1 INIST @2 2652 @5 354000508673580060`
A44				`@0 0000 @1 © 2012 INIST-CNRS. All rights reserved.`
A45				`@0 56 ref.`
A47	`01`	`1`		`@0 12-0102438`
A60				`@1 P`
A61				`@0 A`
A64	`01`	`1`		`@0 Journal of chemical information and modeling`
A66	`01`			`@0 USA`
C01	`01`		`ENG`	@0 The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.
C02	`01`	`X`		`@0 001D02B07D`
C02	`02`	`X`		`@0 001D02C04`
C02	`03`	`X`		`@0 002A01B`
C02	`04`	`X`		`@0 001D02C03`
C03	`01`	`X`	`FRE`	`@0 Fouille donnée @5 06`
C03	`01`	`X`	`ENG`	`@0 Data mining @5 06`
C03	`01`	`X`	`SPA`	`@0 Busca dato @5 06`
C03	`02`	`X`	`FRE`	`@0 Texte @5 07`
C03	`02`	`X`	`ENG`	`@0 Text @5 07`
C03	`02`	`X`	`SPA`	`@0 Texto @5 07`
C03	`03`	`X`	`FRE`	`@0 Bioinformatique @5 08`
C03	`03`	`X`	`ENG`	`@0 Bioinformatics @5 08`
C03	`03`	`X`	`SPA`	`@0 Bioinformática @5 08`
C03	`04`	`X`	`FRE`	`@0 Ontologie @5 09`
C03	`04`	`X`	`ENG`	`@0 Ontology @5 09`
C03	`04`	`X`	`SPA`	`@0 Ontología @5 09`
C03	`05`	`X`	`FRE`	`@0 Reconnaissance caractère @5 10`
C03	`05`	`X`	`ENG`	`@0 Character recognition @5 10`
C03	`05`	`X`	`SPA`	`@0 Reconocimiento carácter @5 10`
C03	`06`	`X`	`FRE`	`@0 Reconnaissance optique caractère @5 11`
C03	`06`	`X`	`ENG`	`@0 Optical character recognition @5 11`
C03	`06`	`X`	`SPA`	`@0 Reconocimento óptico de caracteres @5 11`
C03	`07`	`X`	`FRE`	`@0 Rupture @5 12`
C03	`07`	`X`	`ENG`	`@0 Rupture @5 12`
C03	`07`	`X`	`SPA`	`@0 Ruptura @5 12`
C03	`08`	`X`	`FRE`	`@0 Brevet @5 18`
C03	`08`	`X`	`ENG`	`@0 Patents @5 18`
C03	`08`	`X`	`SPA`	`@0 Patente @5 18`
C03	`09`	`X`	`FRE`	`@0 Propriété industrielle @5 19`
C03	`09`	`X`	`ENG`	`@0 Patent rights @5 19`
C03	`09`	`X`	`SPA`	`@0 Propiedad industrial @5 19`
C03	`10`	`X`	`FRE`	`@0 Dictionnaire automatique @5 20`
C03	`10`	`X`	`ENG`	`@0 Automatic dictionary @5 20`
C03	`10`	`X`	`SPA`	`@0 Diccionario automático @5 20`
C03	`11`	`X`	`FRE`	`@0 Industrie pharmaceutique @5 21`
C03	`11`	`X`	`ENG`	`@0 Pharmaceutical industry @5 21`
C03	`11`	`X`	`SPA`	`@0 Industria farmacéutica @5 21`
C03	`12`	`X`	`FRE`	`@0 Nomenclature @5 22`
C03	`12`	`X`	`ENG`	`@0 Nomenclature @5 22`
C03	`12`	`X`	`SPA`	`@0 Nomenclatura @5 22`
C03	`13`	`X`	`FRE`	`@0 Gène @5 23`
C03	`13`	`X`	`ENG`	`@0 Gene @5 23`
C03	`13`	`X`	`SPA`	`@0 Gen @5 23`
C03	`14`	`X`	`FRE`	`@0 Homme @5 24`
C03	`14`	`X`	`ENG`	`@0 Human @5 24`
C03	`14`	`X`	`SPA`	`@0 Hombre @5 24`
C03	`15`	`X`	`FRE`	`@0 Typographie @5 25`
C03	`15`	`X`	`ENG`	`@0 Typography @5 25`
C03	`15`	`X`	`SPA`	`@0 Tipografía @5 25`
C03	`16`	`X`	`FRE`	`@0 Erreur humaine @5 26`
C03	`16`	`X`	`ENG`	`@0 Human error @5 26`
C03	`16`	`X`	`SPA`	`@0 Error humano @5 26`
C03	`17`	`X`	`FRE`	`@0 Correction automatique @5 27`
C03	`17`	`X`	`ENG`	`@0 Automatic correction @5 27`
C03	`17`	`X`	`SPA`	`@0 Corrección automática @5 27`
C03	`18`	`X`	`FRE`	`@0 Trait union @5 41`
C03	`18`	`X`	`ENG`	`@0 Hyphen @5 41`
C03	`18`	`X`	`SPA`	`@0 Guión @5 41`
C03	`19`	`X`	`FRE`	`@0 . @4 INC @5 82`
C07	`01`	`X`	`FRE`	`@0 Propriété intellectuelle`
C07	`01`	`X`	`ENG`	`@0 Intellectual property`
C07	`01`	`X`	`SPA`	`@0 Propiedad intelectual`
N21				`@1 079`
N44	`01`			`@1 OTO`
N82				`@1 OTO`

Format Inist (serveur)

NO :	PASCAL 12-0102438 INIST
ET :	Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction
AU :	SAYLE (Roger); HONGXING XIE (Paul); MURESAN (Sorel)
AF :	NextMove Software/Cambridge/Royaume-Uni (1 aut.); Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal/431 83 Mölndal/Suède (2 aut., 3 aut.)
DT :	Publication en série; Niveau analytique
SO :	Journal of chemical information and modeling; ISSN 1549-9596; Etats-Unis; Da. 2012; Vol. 52; No. 1; Pp. 51-62; Bibl. 56 ref.
LA :	Anglais
EA :	The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.
CC :	001D02B07D; 001D02C04; 002A01B; 001D02C03
FD :	Fouille donnée; Texte; Bioinformatique; Ontologie; Reconnaissance caractère; Reconnaissance optique caractère; Rupture; Brevet; Propriété industrielle; Dictionnaire automatique; Industrie pharmaceutique; Nomenclature; Gène; Homme; Typographie; Erreur humaine; Correction automatique; Trait union; .
FG :	Propriété intellectuelle
ED :	Data mining; Text; Bioinformatics; Ontology; Character recognition; Optical character recognition; Rupture; Patents; Patent rights; Automatic dictionary; Pharmaceutical industry; Nomenclature; Gene; Human; Typography; Human error; Automatic correction; Hyphen
EG :	Intellectual property
SD :	Busca dato; Texto; Bioinformática; Ontología; Reconocimiento carácter; Reconocimento óptico de caracteres; Ruptura; Patente; Propiedad industrial; Diccionario automático; Industria farmacéutica; Nomenclatura; Gen; Hombre; Tipografía; Error humano; Corrección automática; Guión
LO :	INIST-2652.354000508673580060
ID :	12-0102438

Links to Exploration step

Pascal:12-0102438

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</title>
<author><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
<affiliation><inist:fA14 i1="01"><s1>NextMove Software</s1>
<s2>Cambridge</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
<affiliation><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
<affiliation><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">12-0102438</idno>
<date when="2012">2012</date>
<idno type="stanalyst">PASCAL 12-0102438 INIST</idno>
<idno type="RBID">Pascal:12-0102438</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000102</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</title>
<author><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
<affiliation><inist:fA14 i1="01"><s1>NextMove Software</s1>
<s2>Cambridge</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
<affiliation><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
<affiliation><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Journal of chemical information and modeling</title>
<title level="j" type="abbreviated">J. chem. inf. model. </title>
<idno type="ISSN">1549-9596</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Journal of chemical information and modeling</title>
<title level="j" type="abbreviated">J. chem. inf. model. </title>
<idno type="ISSN">1549-9596</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic correction</term>
<term>Automatic dictionary</term>
<term>Bioinformatics</term>
<term>Character recognition</term>
<term>Data mining</term>
<term>Gene</term>
<term>Human</term>
<term>Human error</term>
<term>Hyphen</term>
<term>Nomenclature</term>
<term>Ontology</term>
<term>Optical character recognition</term>
<term>Patent rights</term>
<term>Patents</term>
<term>Pharmaceutical industry</term>
<term>Rupture</term>
<term>Text</term>
<term>Typography</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Fouille donnée</term>
<term>Texte</term>
<term>Bioinformatique</term>
<term>Ontologie</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Rupture</term>
<term>Brevet</term>
<term>Propriété industrielle</term>
<term>Dictionnaire automatique</term>
<term>Industrie pharmaceutique</term>
<term>Nomenclature</term>
<term>Gène</term>
<term>Homme</term>
<term>Typographie</term>
<term>Erreur humaine</term>
<term>Correction automatique</term>
<term>Trait union</term>
<term>.</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>1549-9596</s0>
</fA01>
<fA03 i2="1"><s0>J. chem. inf. model. </s0>
</fA03>
<fA05><s2>52</s2>
</fA05>
<fA06><s2>1</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG"><s1>Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</s1>
</fA08>
<fA11 i1="01" i2="1"><s1>SAYLE (Roger)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>HONGXING XIE (Paul)</s1>
</fA11>
<fA11 i1="03" i2="1"><s1>MURESAN (Sorel)</s1>
</fA11>
<fA14 i1="01"><s1>NextMove Software</s1>
<s2>Cambridge</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA20><s1>51-62</s1>
</fA20>
<fA21><s1>2012</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA43 i1="01"><s1>INIST</s1>
<s2>2652</s2>
<s5>354000508673580060</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2012 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>56 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>12-0102438</s0>
</fA47>
<fA60><s1>P</s1>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA64 i1="01" i2="1"><s0>Journal of chemical information and modeling</s0>
</fA64>
<fA66 i1="01"><s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>001D02B07D</s0>
</fC02>
<fC02 i1="02" i2="X"><s0>001D02C04</s0>
</fC02>
<fC02 i1="03" i2="X"><s0>002A01B</s0>
</fC02>
<fC02 i1="04" i2="X"><s0>001D02C03</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Fouille donnée</s0>
<s5>06</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Data mining</s0>
<s5>06</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Busca dato</s0>
<s5>06</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Texte</s0>
<s5>07</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Text</s0>
<s5>07</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Texto</s0>
<s5>07</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Bioinformatique</s0>
<s5>08</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Bioinformatics</s0>
<s5>08</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Bioinformática</s0>
<s5>08</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Ontologie</s0>
<s5>09</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Ontology</s0>
<s5>09</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Ontología</s0>
<s5>09</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Reconnaissance caractère</s0>
<s5>10</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Character recognition</s0>
<s5>10</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Reconocimiento carácter</s0>
<s5>10</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>11</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>11</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>11</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE"><s0>Rupture</s0>
<s5>12</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG"><s0>Rupture</s0>
<s5>12</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA"><s0>Ruptura</s0>
<s5>12</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE"><s0>Brevet</s0>
<s5>18</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG"><s0>Patents</s0>
<s5>18</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA"><s0>Patente</s0>
<s5>18</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE"><s0>Propriété industrielle</s0>
<s5>19</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG"><s0>Patent rights</s0>
<s5>19</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA"><s0>Propiedad industrial</s0>
<s5>19</s5>
</fC03>
<fC03 i1="10" i2="X" l="FRE"><s0>Dictionnaire automatique</s0>
<s5>20</s5>
</fC03>
<fC03 i1="10" i2="X" l="ENG"><s0>Automatic dictionary</s0>
<s5>20</s5>
</fC03>
<fC03 i1="10" i2="X" l="SPA"><s0>Diccionario automático</s0>
<s5>20</s5>
</fC03>
<fC03 i1="11" i2="X" l="FRE"><s0>Industrie pharmaceutique</s0>
<s5>21</s5>
</fC03>
<fC03 i1="11" i2="X" l="ENG"><s0>Pharmaceutical industry</s0>
<s5>21</s5>
</fC03>
<fC03 i1="11" i2="X" l="SPA"><s0>Industria farmacéutica</s0>
<s5>21</s5>
</fC03>
<fC03 i1="12" i2="X" l="FRE"><s0>Nomenclature</s0>
<s5>22</s5>
</fC03>
<fC03 i1="12" i2="X" l="ENG"><s0>Nomenclature</s0>
<s5>22</s5>
</fC03>
<fC03 i1="12" i2="X" l="SPA"><s0>Nomenclatura</s0>
<s5>22</s5>
</fC03>
<fC03 i1="13" i2="X" l="FRE"><s0>Gène</s0>
<s5>23</s5>
</fC03>
<fC03 i1="13" i2="X" l="ENG"><s0>Gene</s0>
<s5>23</s5>
</fC03>
<fC03 i1="13" i2="X" l="SPA"><s0>Gen</s0>
<s5>23</s5>
</fC03>
<fC03 i1="14" i2="X" l="FRE"><s0>Homme</s0>
<s5>24</s5>
</fC03>
<fC03 i1="14" i2="X" l="ENG"><s0>Human</s0>
<s5>24</s5>
</fC03>
<fC03 i1="14" i2="X" l="SPA"><s0>Hombre</s0>
<s5>24</s5>
</fC03>
<fC03 i1="15" i2="X" l="FRE"><s0>Typographie</s0>
<s5>25</s5>
</fC03>
<fC03 i1="15" i2="X" l="ENG"><s0>Typography</s0>
<s5>25</s5>
</fC03>
<fC03 i1="15" i2="X" l="SPA"><s0>Tipografía</s0>
<s5>25</s5>
</fC03>
<fC03 i1="16" i2="X" l="FRE"><s0>Erreur humaine</s0>
<s5>26</s5>
</fC03>
<fC03 i1="16" i2="X" l="ENG"><s0>Human error</s0>
<s5>26</s5>
</fC03>
<fC03 i1="16" i2="X" l="SPA"><s0>Error humano</s0>
<s5>26</s5>
</fC03>
<fC03 i1="17" i2="X" l="FRE"><s0>Correction automatique</s0>
<s5>27</s5>
</fC03>
<fC03 i1="17" i2="X" l="ENG"><s0>Automatic correction</s0>
<s5>27</s5>
</fC03>
<fC03 i1="17" i2="X" l="SPA"><s0>Corrección automática</s0>
<s5>27</s5>
</fC03>
<fC03 i1="18" i2="X" l="FRE"><s0>Trait union</s0>
<s5>41</s5>
</fC03>
<fC03 i1="18" i2="X" l="ENG"><s0>Hyphen</s0>
<s5>41</s5>
</fC03>
<fC03 i1="18" i2="X" l="SPA"><s0>Guión</s0>
<s5>41</s5>
</fC03>
<fC03 i1="19" i2="X" l="FRE"><s0>.</s0>
<s4>INC</s4>
<s5>82</s5>
</fC03>
<fC07 i1="01" i2="X" l="FRE"><s0>Propriété intellectuelle</s0>
</fC07>
<fC07 i1="01" i2="X" l="ENG"><s0>Intellectual property</s0>
</fC07>
<fC07 i1="01" i2="X" l="SPA"><s0>Propiedad intelectual</s0>
</fC07>
<fN21><s1>079</s1>
</fN21>
<fN44 i1="01"><s1>OTO</s1>
</fN44>
<fN82><s1>OTO</s1>
</fN82>
</pA>
</standard>
<server><NO>PASCAL 12-0102438 INIST</NO>
<ET>Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</ET>
<AU>SAYLE (Roger); HONGXING XIE (Paul); MURESAN (Sorel)</AU>
<AF>NextMove Software/Cambridge/Royaume-Uni (1 aut.); Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal/431 83 Mölndal/Suède (2 aut., 3 aut.)</AF>
<DT>Publication en série; Niveau analytique</DT>
<SO>Journal of chemical information and modeling; ISSN 1549-9596; Etats-Unis; Da. 2012; Vol. 52; No. 1; Pp. 51-62; Bibl. 56 ref.</SO>
<LA>Anglais</LA>
<EA>The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.</EA>
<CC>001D02B07D; 001D02C04; 002A01B; 001D02C03</CC>
<FD>Fouille donnée; Texte; Bioinformatique; Ontologie; Reconnaissance caractère; Reconnaissance optique caractère; Rupture; Brevet; Propriété industrielle; Dictionnaire automatique; Industrie pharmaceutique; Nomenclature; Gène; Homme; Typographie; Erreur humaine; Correction automatique; Trait union; .</FD>
<FG>Propriété intellectuelle</FG>
<ED>Data mining; Text; Bioinformatics; Ontology; Character recognition; Optical character recognition; Rupture; Patents; Patent rights; Automatic dictionary; Pharmaceutical industry; Nomenclature; Gene; Human; Typography; Human error; Automatic correction; Hyphen</ED>
<EG>Intellectual property</EG>
<SD>Busca dato; Texto; Bioinformática; Ontología; Reconocimiento carácter; Reconocimento óptico de caracteres; Ruptura; Patente; Propiedad industrial; Diccionario automático; Industria farmacéutica; Nomenclatura; Gen; Hombre; Tipografía; Error humano; Corrección automática; Guión</SD>
<LO>INIST-2652.354000508673580060</LO>
<ID>12-0102438</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000102 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000102 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:12-0102438
   |texte=   Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Source :

Descripteurs français

English descriptors

Abstract

Notice en format standard (ISO 2709)

Format Inist (serveur)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri