A search engine for Arabic documents
Identifieur interne : 000120 ( Hal/Checkpoint ); précédent : 000119; suivant : 000121A search engine for Arabic documents
Auteurs : T. Sari [Algérie] ; A. Kefali [Algérie]Source :
Descripteurs français
- Wicri :
- topic : Recherche documentaire.
English descriptors
- mix :
Abstract
This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.
Url:
Links toward previous steps (curation, corpus...)
Links to Exploration step
Hal:hal-00334402Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A search engine for Arabic documents</title>
<author><name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00334402</idno>
<idno type="halId">hal-00334402</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<date when="2008-10">2008-10</date>
<idno type="wicri:Area/Hal/Corpus">000014</idno>
<idno type="wicri:Area/Hal/Curation">000014</idno>
<idno type="wicri:Area/Hal/Checkpoint">000120</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">A search engine for Arabic documents</title>
<author><name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="mix" xml:lang="en"><term>Arabic handwriting recognition</term>
<term>Document retrieval</term>
<term>handwriting segmentation</term>
<term>handwriting segmentation.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.</div>
</front>
</TEI>
<hal api="V3"><titleStmt><title xml:lang="en">A search engine for Arabic documents</title>
<author role="aut"><persName><forename type="first">T.</forename>
<surname>Sari</surname>
</persName>
<email></email>
<idno type="halauthor">363399</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
<author role="aut"><persName><forename type="first">A.</forename>
<surname>Kefali</surname>
</persName>
<email></email>
<idno type="halauthor">363400</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
<editor role="depositor"><persName><forename>Sébastien</forename>
<surname>Adam</surname>
</persName>
<email>Sebastien.Adam@univ-rouen.fr</email>
</editor>
</titleStmt>
<editionStmt><edition n="v1" type="current"><date type="whenSubmitted">2008-10-26 01:33:44</date>
<date type="whenModified">2008-10-28 16:32:13</date>
<date type="whenReleased">2008-10-28 16:32:13</date>
<date type="whenProduced">2008-10</date>
<date type="whenEndEmbargoed">2008-10-26</date>
<ref type="file" target="https://hal.archives-ouvertes.fr/hal-00334402/document"><date notBefore="2008-10-26"></date>
</ref>
<ref type="file" subtype="publisherAgreement" n="1" target="https://hal.archives-ouvertes.fr/hal-00334402/file/paper-21.pdf"><date notBefore="2008-10-26"></date>
</ref>
</edition>
<respStmt><resp>contributor</resp>
<name key="131493"><persName><forename>Sébastien</forename>
<surname>Adam</surname>
</persName>
<email>Sebastien.Adam@univ-rouen.fr</email>
</name>
</respStmt>
</editionStmt>
<publicationStmt><distributor>CCSD</distributor>
<idno type="halId">hal-00334402</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<idno type="halBibtex">sari:hal-00334402</idno>
<idno type="halRefHtml">Antoine Tabbone et Thierry Paquet. Colloque International Francophone sur l'Ecrit et le Document, Oct 2008, France. Groupe de Recherche en Communication Ecrite, pp.97-102, 2008</idno>
<idno type="halRef">Antoine Tabbone et Thierry Paquet. Colloque International Francophone sur l'Ecrit et le Document, Oct 2008, France. Groupe de Recherche en Communication Ecrite, pp.97-102, 2008</idno>
</publicationStmt>
<seriesStmt><idno type="stamp" n="CIFED08">Colloque International Francophone sur l'Ecrit et le Document (CIFED'08)</idno>
</seriesStmt>
<notesStmt><note type="audience" n="2">International</note>
<note type="invited" n="0">No</note>
<note type="popular" n="0">No</note>
<note type="peer" n="1">Yes</note>
<note type="proceedings" n="1">Yes</note>
</notesStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">A search engine for Arabic documents</title>
<author role="aut"><persName><forename type="first">T.</forename>
<surname>Sari</surname>
</persName>
<idno type="halAuthorId">363399</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
<author role="aut"><persName><forename type="first">A.</forename>
<surname>Kefali</surname>
</persName>
<idno type="halAuthorId">363400</idno>
<affiliation ref="#struct-81739"></affiliation>
</author>
</analytic>
<monogr><title level="m">Dixième Colloque International Francophone sur l'Ecrit et le Document</title>
<meeting><title>Colloque International Francophone sur l'Ecrit et le Document</title>
<date type="start">2008-10</date>
<country key="FR">France</country>
</meeting>
<editor>Antoine Tabbone et Thierry Paquet</editor>
<imprint><publisher>Groupe de Recherche en Communication Ecrite</publisher>
<biblScope unit="pp">97-102</biblScope>
<date type="datePub">2008</date>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
<profileDesc><langUsage><language ident="en">English</language>
</langUsage>
<textClass><keywords scheme="author"><term xml:lang="en">handwriting segmentation</term>
<term xml:lang="en">Document retrieval</term>
<term xml:lang="en">Arabic handwriting recognition</term>
<term xml:lang="en">handwriting segmentation.</term>
</keywords>
<classCode scheme="halDomain" n="info.info-tt">Computer Science [cs]/Document and Text Processing</classCode>
<classCode scheme="halDomain" n="info.info-cv">Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV]</classCode>
<classCode scheme="halDomain" n="info.info-ts">Computer Science [cs]/Signal and Image Processing</classCode>
<classCode scheme="halDomain" n="spi.signal">Engineering Sciences [physics]/Signal and Image processing</classCode>
<classCode scheme="halTypology" n="COMM">Conference papers</classCode>
</textClass>
<abstract xml:lang="en">This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.</abstract>
</profileDesc>
</hal>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Hal/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000120 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Hal/Checkpoint/biblio.hfd -nk 000120 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Hal |étape= Checkpoint |type= RBID |clé= Hal:hal-00334402 |texte= A search engine for Arabic documents }}
This area was generated with Dilib version V0.6.32. |