Word-based correction for retrieval of arabic OCR degraded documents
Identifieur interne :
000461 ( PascalFrancis/Curation );
précédent :
000460;
suivant :
000462
Word-based correction for retrieval of arabic OCR degraded documents
Auteurs : Walid Magdy [
Égypte] ;
Kareem Darwish [
Égypte]
Source :
-
Lecture notes in computer science [ 0302-9743 ] ; 2006.
RBID : Pascal:07-0453873
Descripteurs français
English descriptors
Abstract
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.
pA |
A01 | 01 | 1 | | @0 0302-9743 |
---|
A05 | | | | @2 4209 |
---|
A08 | 01 | 1 | ENG | @1 Word-based correction for retrieval of arabic OCR degraded documents |
---|
A09 | 01 | 1 | ENG | @1 String processing and information retrieval : 13th International conference, SPIRE 2006, Glasgow, UK, October 11-13, 2006 : proceedings |
---|
A11 | 01 | 1 | | @1 MAGDY (Walid) |
---|
A11 | 02 | 1 | | @1 DARWISH (Kareem) |
---|
A12 | 01 | 1 | | @1 CRESTANI (Fabio) @9 ed. |
---|
A12 | 02 | 1 | | @1 FERRAGINA (Paolo) @9 ed. |
---|
A12 | 03 | 1 | | @1 SANDERSON (Mark) @9 ed. |
---|
A14 | 01 | | | @1 IBM Technology Development Center P.O. Box 166 El-Ahram @2 Giza @3 EGY @Z 1 aut. @Z 2 aut. |
---|
A20 | | | | @1 205-216 |
---|
A21 | | | | @1 2006 |
---|
A23 | 01 | | | @0 ENG |
---|
A26 | 01 | | | @0 3-540-45774-7 |
---|
A43 | 01 | | | @1 INIST @2 16343 @5 354000153609370170 |
---|
A44 | | | | @0 0000 @1 © 2007 INIST-CNRS. All rights reserved. |
---|
A45 | | | | @0 28 ref. |
---|
A47 | 01 | 1 | | @0 07-0453873 |
---|
A60 | | | | @1 P @2 C |
---|
A61 | | | | @0 A |
---|
A64 | 01 | 1 | | @0 Lecture notes in computer science |
---|
A66 | 01 | | | @0 DEU |
---|
A66 | 02 | | | @0 USA |
---|
C01 | 01 | | ENG | @0 Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages. |
---|
C02 | 01 | X | | @0 001D02B07B |
---|
C03 | 01 | X | FRE | @0 Recherche information @5 01 |
---|
C03 | 01 | X | ENG | @0 Information retrieval @5 01 |
---|
C03 | 01 | X | SPA | @0 Búsqueda información @5 01 |
---|
C03 | 02 | X | FRE | @0 Chaîne caractère @5 02 |
---|
C03 | 02 | X | ENG | @0 Character string @5 02 |
---|
C03 | 02 | X | SPA | @0 Cadena carácter @5 02 |
---|
C03 | 03 | X | FRE | @0 Reconnaissance caractère @5 06 |
---|
C03 | 03 | X | ENG | @0 Character recognition @5 06 |
---|
C03 | 03 | X | SPA | @0 Reconocimiento carácter @5 06 |
---|
C03 | 04 | X | FRE | @0 Reconnaissance optique caractère @5 07 |
---|
C03 | 04 | X | ENG | @0 Optical character recognition @5 07 |
---|
C03 | 04 | X | SPA | @0 Reconocimento óptico de caracteres @5 07 |
---|
C03 | 05 | X | FRE | @0 Informatique diffuse @5 08 |
---|
C03 | 05 | X | ENG | @0 Pervasive computing @5 08 |
---|
C03 | 05 | X | SPA | @0 Informática difusa @5 08 |
---|
C03 | 06 | X | FRE | @0 Indexation @5 09 |
---|
C03 | 06 | X | ENG | @0 Indexing @5 09 |
---|
C03 | 06 | X | SPA | @0 Indización @5 09 |
---|
C03 | 07 | X | FRE | @0 Arabe @5 18 |
---|
C03 | 07 | X | ENG | @0 Arabic @5 18 |
---|
C03 | 07 | X | SPA | @0 Árabe @5 18 |
---|
C03 | 08 | X | FRE | @0 Canal avec bruit @5 19 |
---|
C03 | 08 | X | ENG | @0 Noisy channel @5 19 |
---|
C03 | 08 | X | SPA | @0 Canal con ruido @5 19 |
---|
C03 | 09 | X | FRE | @0 Correction erreur @5 23 |
---|
C03 | 09 | X | ENG | @0 Error correction @5 23 |
---|
C03 | 09 | X | SPA | @0 Corrección error @5 23 |
---|
C03 | 10 | X | FRE | @0 . @4 INC @5 82 |
---|
N21 | | | | @1 295 |
---|
N44 | 01 | | | @1 OTO |
---|
N82 | | | | @1 OTO |
---|
|
pR |
A30 | 01 | 1 | ENG | @1 International Conference on String Processing and Information Retrieval @2 13 @3 Glasgow GBR @4 2006 |
---|
|
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000325
Links to Exploration step
Pascal:07-0453873
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Word-based correction for retrieval of arabic OCR degraded documents</title>
<author><name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
<author><name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">07-0453873</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 07-0453873 INIST</idno>
<idno type="RBID">Pascal:07-0453873</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000325</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000461</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Word-based correction for retrieval of arabic OCR degraded documents</title>
<author><name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
<author><name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
<imprint><date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Arabic</term>
<term>Character recognition</term>
<term>Character string</term>
<term>Error correction</term>
<term>Indexing</term>
<term>Information retrieval</term>
<term>Noisy channel</term>
<term>Optical character recognition</term>
<term>Pervasive computing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Recherche information</term>
<term>Chaîne caractère</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Informatique diffuse</term>
<term>Indexation</term>
<term>Arabe</term>
<term>Canal avec bruit</term>
<term>Correction erreur</term>
<term>.</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>0302-9743</s0>
</fA01>
<fA05><s2>4209</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG"><s1>Word-based correction for retrieval of arabic OCR degraded documents</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>String processing and information retrieval : 13th International conference, SPIRE 2006, Glasgow, UK, October 11-13, 2006 : proceedings</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>MAGDY (Walid)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>DARWISH (Kareem)</s1>
</fA11>
<fA12 i1="01" i2="1"><s1>CRESTANI (Fabio)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1"><s1>FERRAGINA (Paolo)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="03" i2="1"><s1>SANDERSON (Mark)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01"><s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</fA14>
<fA20><s1>205-216</s1>
</fA20>
<fA21><s1>2006</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA26 i1="01"><s0>3-540-45774-7</s0>
</fA26>
<fA43 i1="01"><s1>INIST</s1>
<s2>16343</s2>
<s5>354000153609370170</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2007 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>28 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>07-0453873</s0>
</fA47>
<fA60><s1>P</s1>
<s2>C</s2>
</fA60>
<fA64 i1="01" i2="1"><s0>Lecture notes in computer science</s0>
</fA64>
<fA66 i1="01"><s0>DEU</s0>
</fA66>
<fA66 i1="02"><s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>001D02B07B</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Recherche information</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Information retrieval</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Búsqueda información</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Chaîne caractère</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Character string</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Cadena carácter</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Reconnaissance caractère</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Character recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Reconocimiento carácter</s0>
<s5>06</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>07</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Informatique diffuse</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Pervasive computing</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Informática difusa</s0>
<s5>08</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Indexation</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>Indexing</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Indización</s0>
<s5>09</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE"><s0>Arabe</s0>
<s5>18</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG"><s0>Arabic</s0>
<s5>18</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA"><s0>Árabe</s0>
<s5>18</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE"><s0>Canal avec bruit</s0>
<s5>19</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG"><s0>Noisy channel</s0>
<s5>19</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA"><s0>Canal con ruido</s0>
<s5>19</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE"><s0>Correction erreur</s0>
<s5>23</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG"><s0>Error correction</s0>
<s5>23</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA"><s0>Corrección error</s0>
<s5>23</s5>
</fC03>
<fC03 i1="10" i2="X" l="FRE"><s0>.</s0>
<s4>INC</s4>
<s5>82</s5>
</fC03>
<fN21><s1>295</s1>
</fN21>
<fN44 i1="01"><s1>OTO</s1>
</fN44>
<fN82><s1>OTO</s1>
</fN82>
</pA>
<pR><fA30 i1="01" i2="1" l="ENG"><s1>International Conference on String Processing and Information Retrieval</s1>
<s2>13</s2>
<s3>Glasgow GBR</s3>
<s4>2006</s4>
</fA30>
</pR>
</standard>
</inist>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000461 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000461 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien
|wiki= Ticri/CIDE
|area= OcrV1
|flux= PascalFrancis
|étape= Curation
|type= RBID
|clé= Pascal:07-0453873
|texte= Word-based correction for retrieval of arabic OCR degraded documents
}}
| This area was generated with Dilib version V0.6.32. Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024 | |