OcrV1, PascalFrancis, Corpus, bibRecord, 000841

Reduction of expanded search terms for fuzzy English-text retrieval

Identifieur interne : 000841 ( PascalFrancis/Corpus ); précédent : 000840; suivant : 000842

Reduction of expanded search terms for fuzzy English-text retrieval

Auteurs : M. Ohta ; A. Takasu ; J. Adachi

Source :

Lecture notes in computer science [ 0302-9743 ] ; 1998.

RBID : Pascal:99-0073116

Descripteurs français

Pascal (Inist)
- Reconnaissance caractère, Reconnaissance automatique, Erreur, Recherche information, Système flou, Anglais, Correction erreur, Bibliothèque numérique.

English descriptors

KwdEn :
- Automatic recognition, Character recognition, Digital library, English, Error, Error correction, Fuzzy system, Information retrieval.

Abstract

Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

A01	`01`	`1`		`@0 0302-9743`
A05				`@2 1513`
A08	`01`	`1`	`ENG`	`@1 Reduction of expanded search terms for fuzzy English-text retrieval`
A09	`01`	`1`	`ENG`	`@1 Research and advanced technology for digital libraries : Heraklion, 21-23 September 1998`
A11	`01`	`1`		`@1 OHTA (M.)`
A11	`02`	`1`		`@1 TAKASU (A.)`
A11	`03`	`1`		`@1 ADACHI (J.)`
A12	`01`	`1`		`@1 NIKOLAOU (Christos) @9 ed.`
A12	`02`	`1`		`@1 STEPHANIDIS (Constantine) @9 ed.`
A14	`01`			`@1 Graduate School of Engineering, University of Tokyo @2 Tokyo 113-8654 @3 JPN @Z 1 aut.`
A14	`02`			`@1 Research & Development Department, National Center for Science Information Systems (NACSIS) @2 Tokyo 112-8640 @3 JPN @Z 2 aut. @Z 3 aut.`
A20				`@1 619-633`
A21				`@1 1998`
A23	`01`			`@0 ENG`
A26	`01`			`@0 3-540-65101-2`
A43	`01`			`@1 INIST @2 16343 @5 354000070163720370`
A44				`@0 0000 @1 © 1999 INIST-CNRS. All rights reserved.`
A45				`@0 6 ref.`
A47	`01`	`1`		`@0 99-0073116`
A60				`@1 P @2 C`
A61				`@0 A`
A64		`1`		`@0 Lecture notes in computer science`
A66	`01`			`@0 DEU`
A66	`02`			`@0 USA`
C01	`01`		`ENG`	@0 Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.
C02	`01`	`X`		`@0 001A01E03C`
C02	`02`	`X`		`@0 205`
C03	`01`	`X`	`FRE`	`@0 Reconnaissance caractère @5 03`
C03	`01`	`X`	`ENG`	`@0 Character recognition @5 03`
C03	`01`	`X`	`SPA`	`@0 Reconocimiento carácter @5 03`
C03	`02`	`X`	`FRE`	`@0 Reconnaissance automatique @5 04`
C03	`02`	`X`	`ENG`	`@0 Automatic recognition @5 04`
C03	`02`	`X`	`SPA`	`@0 Reconocimiento automático @5 04`
C03	`03`	`X`	`FRE`	`@0 Erreur @5 05`
C03	`03`	`X`	`ENG`	`@0 Error @5 05`
C03	`03`	`X`	`GER`	`@0 Abweichung @5 05`
C03	`03`	`X`	`SPA`	`@0 Error @5 05`
C03	`04`	`X`	`FRE`	`@0 Recherche information @5 06`
C03	`04`	`X`	`ENG`	`@0 Information retrieval @5 06`
C03	`04`	`X`	`SPA`	`@0 Recuperación información @5 06`
C03	`05`	`X`	`FRE`	`@0 Système flou @5 09`
C03	`05`	`X`	`ENG`	`@0 Fuzzy system @5 09`
C03	`05`	`X`	`SPA`	`@0 Sistema difuso @5 09`
C03	`06`	`X`	`FRE`	`@0 Anglais @5 11`
C03	`06`	`X`	`ENG`	`@0 English @5 11`
C03	`06`	`X`	`SPA`	`@0 Inglés @5 11`
C03	`07`	`X`	`FRE`	`@0 Correction erreur @5 12`
C03	`07`	`X`	`ENG`	`@0 Error correction @5 12`
C03	`07`	`X`	`GER`	`@0 Fehlekorrektur @5 12`
C03	`07`	`X`	`SPA`	`@0 Corrección error @5 12`
C03	`08`	`X`	`FRE`	`@0 Bibliothèque numérique @4 CD @5 96`
C03	`08`	`X`	`ENG`	`@0 Digital library @4 CD @5 96`
N21				`@1 039`

A30	`01`	`1`	`ENG`	`@1 ECDL '98 : European conference on digital libraires @2 2 @3 Heraklion GRC @4 1998-09-21`

Format Inist (serveur)

NO :	PASCAL 99-0073116 INIST
ET :	Reduction of expanded search terms for fuzzy English-text retrieval
AU :	OHTA (M.); TAKASU (A.); ADACHI (J.); NIKOLAOU (Christos); STEPHANIDIS (Constantine)
AF :	Graduate School of Engineering, University of Tokyo/Tokyo 113-8654/Japon (1 aut.); Research & Development Department, National Center for Science Information Systems (NACSIS)/Tokyo 112-8640/Japon (2 aut., 3 aut.)
DT :	Publication en série; Congrès; Niveau analytique
SO :	Lecture notes in computer science; ISSN 0302-9743; Allemagne; Da. 1998; Vol. 1513; Pp. 619-633; Bibl. 6 ref.
LA :	Anglais
EA :	Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.
CC :	001A01E03C; 205
FD :	Reconnaissance caractère; Reconnaissance automatique; Erreur; Recherche information; Système flou; Anglais; Correction erreur; Bibliothèque numérique
ED :	Character recognition; Automatic recognition; Error; Information retrieval; Fuzzy system; English; Error correction; Digital library
GD :	Abweichung; Fehlekorrektur
SD :	Reconocimiento carácter; Reconocimiento automático; Error; Recuperación información; Sistema difuso; Inglés; Corrección error
LO :	INIST-16343.354000070163720370
ID :	99-0073116

Links to Exploration step

Pascal:99-0073116

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Reduction of expanded search terms for fuzzy English-text retrieval</title>
<author><name sortKey="Ohta, M" sort="Ohta, M" uniqKey="Ohta M" first="M." last="Ohta">M. Ohta</name>
<affiliation><inist:fA14 i1="01"><s1>Graduate School of Engineering, University of Tokyo</s1>
<s2>Tokyo 113-8654</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Takasu, A" sort="Takasu, A" uniqKey="Takasu A" first="A." last="Takasu">A. Takasu</name>
<affiliation><inist:fA14 i1="02"><s1>Research & Development Department, National Center for Science Information Systems (NACSIS)</s1>
<s2>Tokyo 112-8640</s2>
<s3>JPN</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Adachi, J" sort="Adachi, J" uniqKey="Adachi J" first="J." last="Adachi">J. Adachi</name>
<affiliation><inist:fA14 i1="02"><s1>Research & Development Department, National Center for Science Information Systems (NACSIS)</s1>
<s2>Tokyo 112-8640</s2>
<s3>JPN</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">99-0073116</idno>
<date when="1998">1998</date>
<idno type="stanalyst">PASCAL 99-0073116 INIST</idno>
<idno type="RBID">Pascal:99-0073116</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000841</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Reduction of expanded search terms for fuzzy English-text retrieval</title>
<author><name sortKey="Ohta, M" sort="Ohta, M" uniqKey="Ohta M" first="M." last="Ohta">M. Ohta</name>
<affiliation><inist:fA14 i1="01"><s1>Graduate School of Engineering, University of Tokyo</s1>
<s2>Tokyo 113-8654</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Takasu, A" sort="Takasu, A" uniqKey="Takasu A" first="A." last="Takasu">A. Takasu</name>
<affiliation><inist:fA14 i1="02"><s1>Research & Development Department, National Center for Science Information Systems (NACSIS)</s1>
<s2>Tokyo 112-8640</s2>
<s3>JPN</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Adachi, J" sort="Adachi, J" uniqKey="Adachi J" first="J." last="Adachi">J. Adachi</name>
<affiliation><inist:fA14 i1="02"><s1>Research & Development Department, National Center for Science Information Systems (NACSIS)</s1>
<s2>Tokyo 112-8640</s2>
<s3>JPN</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
<imprint><date when="1998">1998</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic recognition</term>
<term>Character recognition</term>
<term>Digital library</term>
<term>English</term>
<term>Error</term>
<term>Error correction</term>
<term>Fuzzy system</term>
<term>Information retrieval</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance caractère</term>
<term>Reconnaissance automatique</term>
<term>Erreur</term>
<term>Recherche information</term>
<term>Système flou</term>
<term>Anglais</term>
<term>Correction erreur</term>
<term>Bibliothèque numérique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>0302-9743</s0>
</fA01>
<fA05><s2>1513</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG"><s1>Reduction of expanded search terms for fuzzy English-text retrieval</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>Research and advanced technology for digital libraries : Heraklion, 21-23 September 1998</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>OHTA (M.)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>TAKASU (A.)</s1>
</fA11>
<fA11 i1="03" i2="1"><s1>ADACHI (J.)</s1>
</fA11>
<fA12 i1="01" i2="1"><s1>NIKOLAOU (Christos)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1"><s1>STEPHANIDIS (Constantine)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01"><s1>Graduate School of Engineering, University of Tokyo</s1>
<s2>Tokyo 113-8654</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA14 i1="02"><s1>Research & Development Department, National Center for Science Information Systems (NACSIS)</s1>
<s2>Tokyo 112-8640</s2>
<s3>JPN</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA20><s1>619-633</s1>
</fA20>
<fA21><s1>1998</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA26 i1="01"><s0>3-540-65101-2</s0>
</fA26>
<fA43 i1="01"><s1>INIST</s1>
<s2>16343</s2>
<s5>354000070163720370</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 1999 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>6 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>99-0073116</s0>
</fA47>
<fA60><s1>P</s1>
<s2>C</s2>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA64 i2="1"><s0>Lecture notes in computer science</s0>
</fA64>
<fA66 i1="01"><s0>DEU</s0>
</fA66>
<fA66 i1="02"><s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>001A01E03C</s0>
</fC02>
<fC02 i1="02" i2="X"><s0>205</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Reconnaissance caractère</s0>
<s5>03</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Character recognition</s0>
<s5>03</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Reconocimiento carácter</s0>
<s5>03</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Reconnaissance automatique</s0>
<s5>04</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Automatic recognition</s0>
<s5>04</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Reconocimiento automático</s0>
<s5>04</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Erreur</s0>
<s5>05</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Error</s0>
<s5>05</s5>
</fC03>
<fC03 i1="03" i2="X" l="GER"><s0>Abweichung</s0>
<s5>05</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Error</s0>
<s5>05</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Recherche information</s0>
<s5>06</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Information retrieval</s0>
<s5>06</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Recuperación información</s0>
<s5>06</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Système flou</s0>
<s5>09</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Fuzzy system</s0>
<s5>09</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Sistema difuso</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Anglais</s0>
<s5>11</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>English</s0>
<s5>11</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Inglés</s0>
<s5>11</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE"><s0>Correction erreur</s0>
<s5>12</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG"><s0>Error correction</s0>
<s5>12</s5>
</fC03>
<fC03 i1="07" i2="X" l="GER"><s0>Fehlekorrektur</s0>
<s5>12</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA"><s0>Corrección error</s0>
<s5>12</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE"><s0>Bibliothèque numérique</s0>
<s4>CD</s4>
<s5>96</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG"><s0>Digital library</s0>
<s4>CD</s4>
<s5>96</s5>
</fC03>
<fN21><s1>039</s1>
</fN21>
</pA>
<pR><fA30 i1="01" i2="1" l="ENG"><s1>ECDL '98 : European conference on digital libraires</s1>
<s2>2</s2>
<s3>Heraklion GRC</s3>
<s4>1998-09-21</s4>
</fA30>
</pR>
</standard>
<server><NO>PASCAL 99-0073116 INIST</NO>
<ET>Reduction of expanded search terms for fuzzy English-text retrieval</ET>
<AU>OHTA (M.); TAKASU (A.); ADACHI (J.); NIKOLAOU (Christos); STEPHANIDIS (Constantine)</AU>
<AF>Graduate School of Engineering, University of Tokyo/Tokyo 113-8654/Japon (1 aut.); Research & Development Department, National Center for Science Information Systems (NACSIS)/Tokyo 112-8640/Japon (2 aut., 3 aut.)</AF>
<DT>Publication en série; Congrès; Niveau analytique</DT>
<SO>Lecture notes in computer science; ISSN 0302-9743; Allemagne; Da. 1998; Vol. 1513; Pp. 619-633; Bibl. 6 ref.</SO>
<LA>Anglais</LA>
<EA>Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.</EA>
<CC>001A01E03C; 205</CC>
<FD>Reconnaissance caractère; Reconnaissance automatique; Erreur; Recherche information; Système flou; Anglais; Correction erreur; Bibliothèque numérique</FD>
<ED>Character recognition; Automatic recognition; Error; Information retrieval; Fuzzy system; English; Error correction; Digital library</ED>
<GD>Abweichung; Fehlekorrektur</GD>
<SD>Reconocimiento carácter; Reconocimiento automático; Error; Recuperación información; Sistema difuso; Inglés; Corrección error</SD>
<LO>INIST-16343.354000070163720370</LO>
<ID>99-0073116</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000841 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000841 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:99-0073116
   |texte=   Reduction of expanded search terms for fuzzy English-text retrieval
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Reduction of expanded search terms for fuzzy English-text retrieval

Reduction of expanded search terms for fuzzy English-text retrieval

Source :

Descripteurs français

English descriptors

Abstract

Notice en format standard (ISO 2709)

Format Inist (serveur)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri