Hybred: An OCR Document Representation for Classification Tasks
Identifieur interne : 000097 ( France/Analysis ); précédent : 000096; suivant : 000098Hybred: An OCR Document Representation for Classification Tasks
Auteurs : Sami Laroum [France] ; Nicolas Béchet [France] ; Hatem Hamza [France] ; Mathieu Roche [France]Source :
- International Journal of Computer Science Issues [ 1694-0784 ] ; 2011-05.
English descriptors
- mix :
Abstract
The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to this evaluation, we propose the HYBRED (HYBrid REpresentation of Documents) approach which combines different features in a single relevant representation. The experiments conducted on real data show the interest of this approach.
Url:
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Hal, to step Corpus: 000058
- to stream Hal, to step Curation: 000058
- to stream Hal, to step Checkpoint: 000086
- to stream Main, to step Merge: 000349
- to stream Main, to step Curation: 000344
- to stream Main, to step Exploration: 000344
- to stream France, to step Extraction: 000097
Links to Exploration step
Hal:lirmm-00723581Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Hybred: An OCR Document Representation for Classification Tasks</title>
<author><name sortKey="Laroum, Sami" sort="Laroum, Sami" uniqKey="Laroum S" first="Sami" last="Laroum">Sami Laroum</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-181" status="VALID"><idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc><address><addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation><relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle name="UMR5506" active="#struct-410122" type="direct"><org type="institution" xml:id="struct-410122" status="VALID"><orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc><address><addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Bechet, Nicolas" sort="Bechet, Nicolas" uniqKey="Bechet N" first="Nicolas" last="Béchet">Nicolas Béchet</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-2446" status="OLD"><idno type="RNSR">200318386B</idno>
<orgName>Usage-centered design, analysis and improvement of information systems</orgName>
<orgName type="acronym">AxIS</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-34586" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-86790" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-34586" type="direct"><org type="laboratory" xml:id="struct-34586" status="VALID"><idno type="RNSR">198318250R</idno>
<orgName>Inria Sophia Antipolis - Méditerranée </orgName>
<orgName type="acronym">CRISAM</orgName>
<desc><address><addrLine>2004 route des Lucioles BP 93 06902 Sophia Antipolis</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/sophia/</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-86790" type="direct"><org type="laboratory" xml:id="struct-86790" status="VALID"><idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc><address><addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Hamza, Hatem" sort="Hamza, Hatem" uniqKey="Hamza H" first="Hatem" last="Hamza">Hatem Hamza</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-23810" status="VALID"><orgName>Itesoft R&D</orgName>
<desc><address><addrLine>Parc d'Andron - Le Sequoïa 30470 Aimargues</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.itesoft.fr</ref>
</desc>
<listRelation><relation active="#struct-365824" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-365824" type="direct"><org type="institution" xml:id="struct-365824" status="INCOMING"><orgName>ITESOFT</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Roche, Mathieu" sort="Roche, Mathieu" uniqKey="Roche M" first="Mathieu" last="Roche">Mathieu Roche</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-392245" status="VALID"><orgName>Exploration et exploitation de données textuelles</orgName>
<orgName type="acronym">TEXTE</orgName>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr/recherche/equipes/texte</ref>
</desc>
<listRelation><relation active="#struct-181" type="direct"></relation>
<relation name="UMR5506" active="#struct-410122" type="indirect"></relation>
<relation name="UMR5506" active="#struct-441569" type="indirect"></relation>
</listRelation>
<tutelles><tutelle active="#struct-181" type="direct"><org type="laboratory" xml:id="struct-181" status="VALID"><idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc><address><addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation><relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-410122" type="indirect"><org type="institution" xml:id="struct-410122" status="VALID"><orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc><address><addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="indirect"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:lirmm-00723581</idno>
<idno type="halId">lirmm-00723581</idno>
<idno type="halUri">http://hal-lirmm.ccsd.cnrs.fr/lirmm-00723581</idno>
<idno type="url">http://hal-lirmm.ccsd.cnrs.fr/lirmm-00723581</idno>
<date when="2011-05">2011-05</date>
<idno type="wicri:Area/Hal/Corpus">000058</idno>
<idno type="wicri:Area/Hal/Curation">000058</idno>
<idno type="wicri:Area/Hal/Checkpoint">000086</idno>
<idno type="wicri:doubleKey">1694-0784:2011:Laroum S:hybred:an:ocr</idno>
<idno type="wicri:Area/Main/Merge">000349</idno>
<idno type="wicri:Area/Main/Curation">000344</idno>
<idno type="wicri:Area/Main/Exploration">000344</idno>
<idno type="wicri:Area/France/Extraction">000097</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Hybred: An OCR Document Representation for Classification Tasks</title>
<author><name sortKey="Laroum, Sami" sort="Laroum, Sami" uniqKey="Laroum S" first="Sami" last="Laroum">Sami Laroum</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-181" status="VALID"><idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc><address><addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation><relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles><tutelle name="UMR5506" active="#struct-410122" type="direct"><org type="institution" xml:id="struct-410122" status="VALID"><orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc><address><addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="direct"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Bechet, Nicolas" sort="Bechet, Nicolas" uniqKey="Bechet N" first="Nicolas" last="Béchet">Nicolas Béchet</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-2446" status="OLD"><idno type="RNSR">200318386B</idno>
<orgName>Usage-centered design, analysis and improvement of information systems</orgName>
<orgName type="acronym">AxIS</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
<listRelation><relation active="#struct-34586" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-86790" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-34586" type="direct"><org type="laboratory" xml:id="struct-34586" status="VALID"><idno type="RNSR">198318250R</idno>
<orgName>Inria Sophia Antipolis - Méditerranée </orgName>
<orgName type="acronym">CRISAM</orgName>
<desc><address><addrLine>2004 route des Lucioles BP 93 06902 Sophia Antipolis</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/sophia/</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect"><org type="institution" xml:id="struct-300009" status="VALID"><orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc><address><addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-86790" type="direct"><org type="laboratory" xml:id="struct-86790" status="VALID"><idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc><address><addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation><relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Hamza, Hatem" sort="Hamza, Hatem" uniqKey="Hamza H" first="Hatem" last="Hamza">Hatem Hamza</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-23810" status="VALID"><orgName>Itesoft R&D</orgName>
<desc><address><addrLine>Parc d'Andron - Le Sequoïa 30470 Aimargues</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.itesoft.fr</ref>
</desc>
<listRelation><relation active="#struct-365824" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-365824" type="direct"><org type="institution" xml:id="struct-365824" status="INCOMING"><orgName>ITESOFT</orgName>
<desc><address><country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Roche, Mathieu" sort="Roche, Mathieu" uniqKey="Roche M" first="Mathieu" last="Roche">Mathieu Roche</name>
<affiliation wicri:level="1"><hal:affiliation type="researchteam" xml:id="struct-392245" status="VALID"><orgName>Exploration et exploitation de données textuelles</orgName>
<orgName type="acronym">TEXTE</orgName>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr/recherche/equipes/texte</ref>
</desc>
<listRelation><relation active="#struct-181" type="direct"></relation>
<relation name="UMR5506" active="#struct-410122" type="indirect"></relation>
<relation name="UMR5506" active="#struct-441569" type="indirect"></relation>
</listRelation>
<tutelles><tutelle active="#struct-181" type="direct"><org type="laboratory" xml:id="struct-181" status="VALID"><idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc><address><addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation><relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-410122" type="indirect"><org type="institution" xml:id="struct-410122" status="VALID"><orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc><address><addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="indirect"><org type="institution" xml:id="struct-441569" status="VALID"><idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc><address><country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
<series><title level="j">International Journal of Computer Science Issues</title>
<idno type="ISSN">1694-0784</idno>
<imprint><date type="datePub">2011-05</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="mix" xml:lang="en"><term>Data Mining</term>
<term>Information Retrieval</term>
<term>OCR</term>
<term>Text Mining</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to this evaluation, we propose the HYBRED (HYBrid REpresentation of Documents) approach which combines different features in a single relevant representation. The experiments conducted on real data show the interest of this approach.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
</list>
<tree><country name="France"><noRegion><name sortKey="Laroum, Sami" sort="Laroum, Sami" uniqKey="Laroum S" first="Sami" last="Laroum">Sami Laroum</name>
</noRegion>
<name sortKey="Bechet, Nicolas" sort="Bechet, Nicolas" uniqKey="Bechet N" first="Nicolas" last="Béchet">Nicolas Béchet</name>
<name sortKey="Hamza, Hatem" sort="Hamza, Hatem" uniqKey="Hamza H" first="Hatem" last="Hamza">Hatem Hamza</name>
<name sortKey="Roche, Mathieu" sort="Roche, Mathieu" uniqKey="Roche M" first="Mathieu" last="Roche">Mathieu Roche</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000097 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000097 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= France |étape= Analysis |type= RBID |clé= Hal:lirmm-00723581 |texte= Hybred: An OCR Document Representation for Classification Tasks }}
This area was generated with Dilib version V0.6.32. |