Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A search engine for Arabic documents

Identifieur interne : 000120 ( Hal/Checkpoint ); précédent : 000119; suivant : 000121

A search engine for Arabic documents

Auteurs : T. Sari [Algérie] ; A. Kefali [Algérie]

Source :

RBID : Hal:hal-00334402

Descripteurs français

English descriptors

Abstract

This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.

Url:

Links toward previous steps (curation, corpus...)


Links to Exploration step

Hal:hal-00334402

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A search engine for Arabic documents</title>
<author>
<name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00334402</idno>
<idno type="halId">hal-00334402</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<date when="2008-10">2008-10</date>
<idno type="wicri:Area/Hal/Corpus">000014</idno>
<idno type="wicri:Area/Hal/Curation">000014</idno>
<idno type="wicri:Area/Hal/Checkpoint">000120</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">A search engine for Arabic documents</title>
<author>
<name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Arabic handwriting recognition</term>
<term>Document retrieval</term>
<term>handwriting segmentation</term>
<term>handwriting segmentation.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.</div>
</front>
</TEI>
<hal api="V3">
<titleStmt>
<title xml:lang="en">A search engine for Arabic documents</title>
<author role="aut">
<persName>
<forename type="first">T.</forename>
<surname>Sari</surname>
</persName>
<email></email>
<idno type="halauthor">363399</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
<author role="aut">
<persName>
<forename type="first">A.</forename>
<surname>Kefali</surname>
</persName>
<email></email>
<idno type="halauthor">363400</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
<editor role="depositor">
<persName>
<forename>Sébastien</forename>
<surname>Adam</surname>
</persName>
<email>Sebastien.Adam@univ-rouen.fr</email>
</editor>
</titleStmt>
<editionStmt>
<edition n="v1" type="current">
<date type="whenSubmitted">2008-10-26 01:33:44</date>
<date type="whenModified">2008-10-28 16:32:13</date>
<date type="whenReleased">2008-10-28 16:32:13</date>
<date type="whenProduced">2008-10</date>
<date type="whenEndEmbargoed">2008-10-26</date>
<ref type="file" target="https://hal.archives-ouvertes.fr/hal-00334402/document">
<date notBefore="2008-10-26"></date>
</ref>
<ref type="file" subtype="publisherAgreement" n="1" target="https://hal.archives-ouvertes.fr/hal-00334402/file/paper-21.pdf">
<date notBefore="2008-10-26"></date>
</ref>
</edition>
<respStmt>
<resp>contributor</resp>
<name key="131493">
<persName>
<forename>Sébastien</forename>
<surname>Adam</surname>
</persName>
<email>Sebastien.Adam@univ-rouen.fr</email>
</name>
</respStmt>
</editionStmt>
<publicationStmt>
<distributor>CCSD</distributor>
<idno type="halId">hal-00334402</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<idno type="halBibtex">sari:hal-00334402</idno>
<idno type="halRefHtml">Antoine Tabbone et Thierry Paquet. Colloque International Francophone sur l'Ecrit et le Document, Oct 2008, France. Groupe de Recherche en Communication Ecrite, pp.97-102, 2008</idno>
<idno type="halRef">Antoine Tabbone et Thierry Paquet. Colloque International Francophone sur l'Ecrit et le Document, Oct 2008, France. Groupe de Recherche en Communication Ecrite, pp.97-102, 2008</idno>
</publicationStmt>
<seriesStmt>
<idno type="stamp" n="CIFED08">Colloque International Francophone sur l'Ecrit et le Document (CIFED'08)</idno>
</seriesStmt>
<notesStmt>
<note type="audience" n="2">International</note>
<note type="invited" n="0">No</note>
<note type="popular" n="0">No</note>
<note type="peer" n="1">Yes</note>
<note type="proceedings" n="1">Yes</note>
</notesStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">A search engine for Arabic documents</title>
<author role="aut">
<persName>
<forename type="first">T.</forename>
<surname>Sari</surname>
</persName>
<idno type="halAuthorId">363399</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
<author role="aut">
<persName>
<forename type="first">A.</forename>
<surname>Kefali</surname>
</persName>
<idno type="halAuthorId">363400</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
</analytic>
<monogr>
<title level="m">Dixième Colloque International Francophone sur l'Ecrit et le Document</title>
<meeting>
<title>Colloque International Francophone sur l'Ecrit et le Document</title>
<date type="start">2008-10</date>
<country key="FR">France</country>
</meeting>
<editor>Antoine Tabbone et Thierry Paquet</editor>
<imprint>
<publisher>Groupe de Recherche en Communication Ecrite</publisher>
<biblScope unit="pp">97-102</biblScope>
<date type="datePub">2008</date>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
<profileDesc>
<langUsage>
<language ident="en">English</language>
</langUsage>
<textClass>
<keywords scheme="author">
<term xml:lang="en">handwriting segmentation</term>
<term xml:lang="en">Document retrieval</term>
<term xml:lang="en">Arabic handwriting recognition</term>
<term xml:lang="en">handwriting segmentation.</term>
</keywords>
<classCode scheme="halDomain" n="info.info-tt">Computer Science [cs]/Document and Text Processing</classCode>
<classCode scheme="halDomain" n="info.info-cv">Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV]</classCode>
<classCode scheme="halDomain" n="info.info-ts">Computer Science [cs]/Signal and Image Processing</classCode>
<classCode scheme="halDomain" n="spi.signal">Engineering Sciences [physics]/Signal and Image processing</classCode>
<classCode scheme="halTypology" n="COMM">Conference papers</classCode>
</textClass>
<abstract xml:lang="en">This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.</abstract>
</profileDesc>
</hal>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Hal/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000120 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Hal/Checkpoint/biblio.hfd -nk 000120 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Hal
   |étape=   Checkpoint
   |type=    RBID
   |clé=     Hal:hal-00334402
   |texte=   A search engine for Arabic documents
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024