Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Word-based correction for retrieval of arabic OCR degraded documents

Identifieur interne : 000461 ( PascalFrancis/Curation ); précédent : 000460; suivant : 000462

Word-based correction for retrieval of arabic OCR degraded documents

Auteurs : Walid Magdy [Égypte] ; Kareem Darwish [Égypte]

Source :

RBID : Pascal:07-0453873

Descripteurs français

English descriptors

Abstract

Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.
pA  
A01 01  1    @0 0302-9743
A05       @2 4209
A08 01  1  ENG  @1 Word-based correction for retrieval of arabic OCR degraded documents
A09 01  1  ENG  @1 String processing and information retrieval : 13th International conference, SPIRE 2006, Glasgow, UK, October 11-13, 2006 : proceedings
A11 01  1    @1 MAGDY (Walid)
A11 02  1    @1 DARWISH (Kareem)
A12 01  1    @1 CRESTANI (Fabio) @9 ed.
A12 02  1    @1 FERRAGINA (Paolo) @9 ed.
A12 03  1    @1 SANDERSON (Mark) @9 ed.
A14 01      @1 IBM Technology Development Center P.O. Box 166 El-Ahram @2 Giza @3 EGY @Z 1 aut. @Z 2 aut.
A20       @1 205-216
A21       @1 2006
A23 01      @0 ENG
A26 01      @0 3-540-45774-7
A43 01      @1 INIST @2 16343 @5 354000153609370170
A44       @0 0000 @1 © 2007 INIST-CNRS. All rights reserved.
A45       @0 28 ref.
A47 01  1    @0 07-0453873
A60       @1 P @2 C
A61       @0 A
A64 01  1    @0 Lecture notes in computer science
A66 01      @0 DEU
A66 02      @0 USA
C01 01    ENG  @0 Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.
C02 01  X    @0 001D02B07B
C03 01  X  FRE  @0 Recherche information @5 01
C03 01  X  ENG  @0 Information retrieval @5 01
C03 01  X  SPA  @0 Búsqueda información @5 01
C03 02  X  FRE  @0 Chaîne caractère @5 02
C03 02  X  ENG  @0 Character string @5 02
C03 02  X  SPA  @0 Cadena carácter @5 02
C03 03  X  FRE  @0 Reconnaissance caractère @5 06
C03 03  X  ENG  @0 Character recognition @5 06
C03 03  X  SPA  @0 Reconocimiento carácter @5 06
C03 04  X  FRE  @0 Reconnaissance optique caractère @5 07
C03 04  X  ENG  @0 Optical character recognition @5 07
C03 04  X  SPA  @0 Reconocimento óptico de caracteres @5 07
C03 05  X  FRE  @0 Informatique diffuse @5 08
C03 05  X  ENG  @0 Pervasive computing @5 08
C03 05  X  SPA  @0 Informática difusa @5 08
C03 06  X  FRE  @0 Indexation @5 09
C03 06  X  ENG  @0 Indexing @5 09
C03 06  X  SPA  @0 Indización @5 09
C03 07  X  FRE  @0 Arabe @5 18
C03 07  X  ENG  @0 Arabic @5 18
C03 07  X  SPA  @0 Árabe @5 18
C03 08  X  FRE  @0 Canal avec bruit @5 19
C03 08  X  ENG  @0 Noisy channel @5 19
C03 08  X  SPA  @0 Canal con ruido @5 19
C03 09  X  FRE  @0 Correction erreur @5 23
C03 09  X  ENG  @0 Error correction @5 23
C03 09  X  SPA  @0 Corrección error @5 23
C03 10  X  FRE  @0 . @4 INC @5 82
N21       @1 295
N44 01      @1 OTO
N82       @1 OTO
pR  
A30 01  1  ENG  @1 International Conference on String Processing and Information Retrieval @2 13 @3 Glasgow GBR @4 2006

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:07-0453873

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Word-based correction for retrieval of arabic OCR degraded documents</title>
<author>
<name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
<author>
<name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">07-0453873</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 07-0453873 INIST</idno>
<idno type="RBID">Pascal:07-0453873</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000325</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000461</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Word-based correction for retrieval of arabic OCR degraded documents</title>
<author>
<name sortKey="Magdy, Walid" sort="Magdy, Walid" uniqKey="Magdy W" first="Walid" last="Magdy">Walid Magdy</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
<author>
<name sortKey="Darwish, Kareem" sort="Darwish, Kareem" uniqKey="Darwish K" first="Kareem" last="Darwish">Kareem Darwish</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
<imprint>
<date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Arabic</term>
<term>Character recognition</term>
<term>Character string</term>
<term>Error correction</term>
<term>Indexing</term>
<term>Information retrieval</term>
<term>Noisy channel</term>
<term>Optical character recognition</term>
<term>Pervasive computing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Recherche information</term>
<term>Chaîne caractère</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Informatique diffuse</term>
<term>Indexation</term>
<term>Arabe</term>
<term>Canal avec bruit</term>
<term>Correction erreur</term>
<term>.</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0302-9743</s0>
</fA01>
<fA05>
<s2>4209</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>Word-based correction for retrieval of arabic OCR degraded documents</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>String processing and information retrieval : 13th International conference, SPIRE 2006, Glasgow, UK, October 11-13, 2006 : proceedings</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>MAGDY (Walid)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>DARWISH (Kareem)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>CRESTANI (Fabio)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>FERRAGINA (Paolo)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="03" i2="1">
<s1>SANDERSON (Mark)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>IBM Technology Development Center P.O. Box 166 El-Ahram</s1>
<s2>Giza</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</fA14>
<fA20>
<s1>205-216</s1>
</fA20>
<fA21>
<s1>2006</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA26 i1="01">
<s0>3-540-45774-7</s0>
</fA26>
<fA43 i1="01">
<s1>INIST</s1>
<s2>16343</s2>
<s5>354000153609370170</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2007 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>28 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>07-0453873</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Lecture notes in computer science</s0>
</fA64>
<fA66 i1="01">
<s0>DEU</s0>
</fA66>
<fA66 i1="02">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>001D02B07B</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Recherche information</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Information retrieval</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Búsqueda información</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Chaîne caractère</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Character string</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Cadena carácter</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Reconnaissance caractère</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Character recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Reconocimiento carácter</s0>
<s5>06</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>07</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Informatique diffuse</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Pervasive computing</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Informática difusa</s0>
<s5>08</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE">
<s0>Indexation</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG">
<s0>Indexing</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA">
<s0>Indización</s0>
<s5>09</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE">
<s0>Arabe</s0>
<s5>18</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG">
<s0>Arabic</s0>
<s5>18</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA">
<s0>Árabe</s0>
<s5>18</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE">
<s0>Canal avec bruit</s0>
<s5>19</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG">
<s0>Noisy channel</s0>
<s5>19</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA">
<s0>Canal con ruido</s0>
<s5>19</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE">
<s0>Correction erreur</s0>
<s5>23</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG">
<s0>Error correction</s0>
<s5>23</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA">
<s0>Corrección error</s0>
<s5>23</s5>
</fC03>
<fC03 i1="10" i2="X" l="FRE">
<s0>.</s0>
<s4>INC</s4>
<s5>82</s5>
</fC03>
<fN21>
<s1>295</s1>
</fN21>
<fN44 i1="01">
<s1>OTO</s1>
</fN44>
<fN82>
<s1>OTO</s1>
</fN82>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>International Conference on String Processing and Information Retrieval</s1>
<s2>13</s2>
<s3>Glasgow GBR</s3>
<s4>2006</s4>
</fA30>
</pR>
</standard>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000461 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000461 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Curation
   |type=    RBID
   |clé=     Pascal:07-0453873
   |texte=   Word-based correction for retrieval of arabic OCR degraded documents
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024