OcrV1, PascalFrancis, Corpus, bibRecord, 000252

Effect of OCR error correction on Arabic retrieval

Identifieur interne : 000252 ( PascalFrancis/Corpus ); précédent : 000251; suivant : 000253

Effect of OCR error correction on Arabic retrieval

Auteurs : Walid Magdy ; Kareem Darwish

Source :

Information retrieval : (Boston) [ 1386-4564 ] ; 2008.

RBID : Francis:09-0100461

Descripteurs français

Pascal (Inist)
- Recherche information, Arabe, Modèle de langage, Correction erreur, Reconnaissance optique caractère.

English descriptors

KwdEn :
- Arabic, Error correction, Information retrieval, Language model, Optical character recognition.

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

A01	`01`	`1`		`@0 1386-4564`
A03		`1`		`@0 Inf. retr. : (Boston)`
A05				`@2 11`
A06				`@2 5`
A08	`01`	`1`	`ENG`	`@1 Effect of OCR error correction on Arabic retrieval`
A11	`01`	`1`		`@1 MAGDY (Walid)`
A11	`02`	`1`		`@1 DARWISH (Kareem)`
A14	`01`			`@1 Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd @2 Abou Rawash @3 EGY @Z 1 aut. @Z 2 aut.`
A20				`@1 405-425`
A21				`@1 2008`
A23	`01`			`@0 ENG`
A43	`01`			`@1 INIST @2 27066 @5 354000200338910020`
A44				`@0 0000 @1 © 2009 INIST-CNRS. All rights reserved.`
A45				`@0 2 p.1/2`
A47	`01`	`1`		`@0 09-0100461`
A60				`@1 P`
A61				`@0 A`
A64	`01`	`1`		`@0 Information retrieval : (Boston)`
A66	`01`			`@0 NLD`
C01	`01`		`ENG`	@0 Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.
C02	`01`	`X`		`@0 790F03C @1 VI`
C03	`01`	`X`	`FRE`	`@0 Recherche information @5 04`
C03	`01`	`X`	`ENG`	`@0 Information retrieval @5 04`
C03	`01`	`X`	`SPA`	`@0 Búsqueda información @5 04`
C03	`02`	`X`	`FRE`	`@0 Arabe @5 05`
C03	`02`	`X`	`ENG`	`@0 Arabic @5 05`
C03	`02`	`X`	`SPA`	`@0 Árabe @5 05`
C03	`03`	`L`	`FRE`	`@0 Modèle de langage @2 563 @5 06`
C03	`03`	`L`	`ENG`	`@0 Language model @2 563 @5 06`
C03	`04`	`X`	`FRE`	`@0 Correction erreur @5 07`
C03	`04`	`X`	`ENG`	`@0 Error correction @5 07`
C03	`04`	`X`	`SPA`	`@0 Corrección error @5 07`
C03	`05`	`X`	`FRE`	`@0 Reconnaissance optique caractère @5 08`
C03	`05`	`X`	`ENG`	`@0 Optical character recognition @5 08`
C03	`05`	`X`	`SPA`	`@0 Reconocimento óptico de caracteres @5 08`
N21				`@1 068`

Format Inist (serveur)

NO :	FRANCIS 09-0100461 INIST
ET :	Effect of OCR error correction on Arabic retrieval
AU :	MAGDY (Walid); DARWISH (Kareem)
AF :	Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd/Abou Rawash/Egypte (1 aut., 2 aut.)
DT :	Publication en série; Niveau analytique
SO :	Information retrieval : (Boston); ISSN 1386-4564; Pays-Bas; Da. 2008; Vol. 11; No. 5; Pp. 405-425; Bibl. 2 p.1/2
LA :	Anglais
EA :	Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.
CC :	790F03C
FD :	Recherche information; Arabe; Modèle de langage; Correction erreur; Reconnaissance optique caractère
ED :	Information retrieval; Arabic; Language model; Error correction; Optical character recognition
SD :	Búsqueda información; Árabe; Corrección error; Reconocimento óptico de caracteres
LO :	INIST-27066.354000200338910020
ID :	09-0100461

Links to Exploration step

Francis:09-0100461

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Effect of OCR error correction on Arabic retrieval</title>
<author><name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation><inist:fA14 i1="01"><s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation><inist:fA14 i1="01"><s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">09-0100461</idno>
<date when="2008">2008</date>
<idno type="stanalyst">FRANCIS 09-0100461 INIST</idno>
<idno type="RBID">Francis:09-0100461</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000252</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Effect of OCR error correction on Arabic retrieval</title>
<author><name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation><inist:fA14 i1="01"><s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation><inist:fA14 i1="01"><s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Information retrieval : (Boston)</title>
<title level="j" type="abbreviated">Inf. retr. : (Boston)</title>
<idno type="ISSN">1386-4564</idno>
<imprint><date when="2008">2008</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Information retrieval : (Boston)</title>
<title level="j" type="abbreviated">Inf. retr. : (Boston)</title>
<idno type="ISSN">1386-4564</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Arabic</term>
<term>Error correction</term>
<term>Information retrieval</term>
<term>Language model</term>
<term>Optical character recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Recherche information</term>
<term>Arabe</term>
<term>Modèle de langage</term>
<term>Correction erreur</term>
<term>Reconnaissance optique caractère</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>1386-4564</s0>
</fA01>
<fA03 i2="1"><s0>Inf. retr. : (Boston)</s0>
</fA03>
<fA05><s2>11</s2>
</fA05>
<fA06><s2>5</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG"><s1>Effect of OCR error correction on Arabic retrieval</s1>
</fA08>
<fA11 i1="01" i2="1"><s1>MAGDY (Walid)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>DARWISH (Kareem)</s1>
</fA11>
<fA14 i1="01"><s1>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd</s1>
<s2>Abou Rawash</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</fA14>
<fA20><s1>405-425</s1>
</fA20>
<fA21><s1>2008</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA43 i1="01"><s1>INIST</s1>
<s2>27066</s2>
<s5>354000200338910020</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2009 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>2 p.1/2</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>09-0100461</s0>
</fA47>
<fA60><s1>P</s1>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA64 i1="01" i2="1"><s0>Information retrieval : (Boston)</s0>
</fA64>
<fA66 i1="01"><s0>NLD</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>790F03C</s0>
<s1>VI</s1>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Recherche information</s0>
<s5>04</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Information retrieval</s0>
<s5>04</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Búsqueda información</s0>
<s5>04</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Arabe</s0>
<s5>05</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Arabic</s0>
<s5>05</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Árabe</s0>
<s5>05</s5>
</fC03>
<fC03 i1="03" i2="L" l="FRE"><s0>Modèle de langage</s0>
<s2>563</s2>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="L" l="ENG"><s0>Language model</s0>
<s2>563</s2>
<s5>06</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Correction erreur</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Error correction</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Corrección error</s0>
<s5>07</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>08</s5>
</fC03>
<fN21><s1>068</s1>
</fN21>
</pA>
</standard>
<server><NO>FRANCIS 09-0100461 INIST</NO>
<ET>Effect of OCR error correction on Arabic retrieval</ET>
<AU>MAGDY (Walid); DARWISH (Kareem)</AU>
<AF>Cairo Microsoft Innovation Center, Smart Village-Bldg B115, Km 28, Cairo-Alexandria Desert Rd/Abou Rawash/Egypte (1 aut., 2 aut.)</AF>
<DT>Publication en série; Niveau analytique</DT>
<SO>Information retrieval : (Boston); ISSN 1386-4564; Pays-Bas; Da. 2008; Vol. 11; No. 5; Pp. 405-425; Bibl. 2 p.1/2</SO>
<LA>Anglais</LA>
<EA>Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.</EA>
<CC>790F03C</CC>
<FD>Recherche information; Arabe; Modèle de langage; Correction erreur; Reconnaissance optique caractère</FD>
<ED>Information retrieval; Arabic; Language model; Error correction; Optical character recognition</ED>
<SD>Búsqueda información; Árabe; Corrección error; Reconocimento óptico de caracteres</SD>
<LO>INIST-27066.354000200338910020</LO>
<ID>09-0100461</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000252 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000252 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Francis:09-0100461
   |texte=   Effect of OCR error correction on Arabic retrieval
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Effect of OCR error correction on Arabic retrieval

Effect of OCR error correction on Arabic retrieval

Source :

Descripteurs français

English descriptors

Abstract

Notice en format standard (ISO 2709)

Format Inist (serveur)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri