Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Recognition of Table of Contents for Electronic Library Consulting

Identifieur interne : 000101 ( Hal/Corpus ); précédent : 000100; suivant : 000102

Recognition of Table of Contents for Electronic Library Consulting

Auteurs : Abdel Belaïd

Source :

RBID : Hal:inria-00100452

Descripteurs français

Abstract

A labeling approach for automatic recognition of Tables of Contents (ToC) is described in this paper. A prototype is used for electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labeling without using any a priori model. Labeling is based on a Part of Speech Tagging (PoS) which is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: ``title'' and ``authors''. Non labeled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different ToC layouts and character recognition qualities. Without manual intervention, 96.3\% rate of correct segmentation was obtained on 38 journals including 2020 articles and 93.0\% rate of correct field extraction.

Url:

Links to Exploration step

Hal:inria-00100452

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Recognition of Table of Contents for Electronic Library Consulting</title>
<author>
<name sortKey="Belaid, Abdel" sort="Belaid, Abdel" uniqKey="Belaid A" first="Abdel" last="Belaïd">Abdel Belaïd</name>
<affiliation>
<hal:affiliation type="researchteam" xml:id="struct-2362" status="OLD">
<orgName>READ</orgName>
<orgName type="acronym">READ</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00100452</idno>
<idno type="halId">inria-00100452</idno>
<idno type="halUri">https://hal.inria.fr/inria-00100452</idno>
<idno type="url">https://hal.inria.fr/inria-00100452</idno>
<date when="2001">2001</date>
<idno type="wicri:Area/Hal/Corpus">000101</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Recognition of Table of Contents for Electronic Library Consulting</title>
<author>
<name sortKey="Belaid, Abdel" sort="Belaid, Abdel" uniqKey="Belaid A" first="Abdel" last="Belaïd">Abdel Belaïd</name>
<affiliation>
<hal:affiliation type="researchteam" xml:id="struct-2362" status="OLD">
<orgName>READ</orgName>
<orgName type="acronym">READ</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-160" type="direct"></relation>
<relation name="UMR7503" active="#struct-441569" type="indirect"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-300291" type="indirect"></relation>
<relation active="#struct-300292" type="indirect"></relation>
<relation active="#struct-300293" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-160" type="direct">
<org type="laboratory" xml:id="struct-160" status="OLD">
<orgName>Laboratoire Lorrain de Recherche en Informatique et ses Applications</orgName>
<orgName type="acronym">LORIA</orgName>
<desc>
<address>
<addrLine>Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.loria.fr</ref>
</desc>
<listRelation>
<relation name="UMR7503" active="#struct-441569" type="direct"></relation>
<relation active="#struct-300009" type="direct"></relation>
<relation active="#struct-300291" type="direct"></relation>
<relation active="#struct-300292" type="direct"></relation>
<relation active="#struct-300293" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR7503" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300291" type="indirect">
<org type="institution" xml:id="struct-300291" status="OLD">
<orgName>Université Henri Poincaré - Nancy 1</orgName>
<orgName type="acronym">UHP</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>24-30 rue Lionnois, BP 60120, 54 003 NANCY cedex, France</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300292" type="indirect">
<org type="institution" xml:id="struct-300292" status="OLD">
<orgName>Université Nancy 2</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<addrLine>91 avenue de la Libération, BP 454, 54001 Nancy cedex</addrLine>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300293" type="indirect">
<org type="institution" xml:id="struct-300293" status="OLD">
<orgName>Institut National Polytechnique de Lorraine</orgName>
<orgName type="acronym">INPL</orgName>
<date type="end">2011-12-31</date>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
</affiliation>
</author>
</analytic>
<series>
<title level="j">International Journal on Document Analysis and Recognition</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date type="datePub">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="fr">
<term>calliope project</term>
<term>digital library</term>
<term>projet calliope</term>
<term>reconnaissance de documents</term>
<term>table de matières</term>
<term>table of contents recognition -- part-of-speech tagging -- ocr combination</term>
<term>étiquetage par partie de discours</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">A labeling approach for automatic recognition of Tables of Contents (ToC) is described in this paper. A prototype is used for electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labeling without using any a priori model. Labeling is based on a Part of Speech Tagging (PoS) which is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: ``title'' and ``authors''. Non labeled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different ToC layouts and character recognition qualities. Without manual intervention, 96.3\% rate of correct segmentation was obtained on 38 journals including 2020 articles and 93.0\% rate of correct field extraction.</div>
</front>
</TEI>
<hal api="V3">
<titleStmt>
<title xml:lang="en">Recognition of Table of Contents for Electronic Library Consulting</title>
<author role="aut">
<persName>
<forename type="first">Abdel</forename>
<surname>Belaïd</surname>
</persName>
<email></email>
<idno type="halauthor">129619</idno>
<orgName ref="#struct-441569"></orgName>
<affiliation ref="#struct-2362"></affiliation>
</author>
<editor role="depositor">
<persName>
<forename>Publications</forename>
<surname>Loria</surname>
</persName>
<email>publications@loria.fr</email>
</editor>
</titleStmt>
<editionStmt>
<edition n="v1" type="current">
<date type="whenSubmitted">2006-09-26 14:45:53</date>
<date type="whenModified">2016-05-19 01:05:20</date>
<date type="whenReleased">2006-09-28 15:22:47</date>
<date type="whenProduced">2001</date>
</edition>
<respStmt>
<resp>contributor</resp>
<name key="108626">
<persName>
<forename>Publications</forename>
<surname>Loria</surname>
</persName>
<email>publications@loria.fr</email>
</name>
</respStmt>
</editionStmt>
<publicationStmt>
<distributor>CCSD</distributor>
<idno type="halId">inria-00100452</idno>
<idno type="halUri">https://hal.inria.fr/inria-00100452</idno>
<idno type="halBibtex">belaid:inria-00100452</idno>
<idno type="halRefHtml">International Journal on Document Analysis and Recognition, Springer Verlag, 2001, 4 (1), pp.35-45</idno>
<idno type="halRef">International Journal on Document Analysis and Recognition, Springer Verlag, 2001, 4 (1), pp.35-45</idno>
</publicationStmt>
<seriesStmt>
<idno type="stamp" n="INRIA">INRIA - Institut National de Recherche en Informatique et en Automatique</idno>
<idno type="stamp" n="CNRS">CNRS - Centre national de la recherche scientifique</idno>
<idno type="stamp" n="LORIA2">Publications du LORIA</idno>
<idno type="stamp" n="LABO-LORIA-SET" p="LORIA">LABO-LORIA-SET</idno>
<idno type="stamp" n="LORIA">LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications</idno>
<idno type="stamp" n="LORIA-TALC" p="LORIA">Traitement automatique des langues et des connaissances</idno>
<idno type="stamp" n="UNIV-LORRAINE">Université de Lorraine</idno>
<idno type="stamp" n="INPL">Institut National Polytechnique de Lorraine</idno>
</seriesStmt>
<notesStmt>
<note type="commentary">Article dans revue scientifique avec comité de lecture.</note>
<note type="audience" n="1">Not set</note>
<note type="popular" n="0">No</note>
<note type="peer" n="1">Yes</note>
</notesStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Recognition of Table of Contents for Electronic Library Consulting</title>
<author role="aut">
<persName>
<forename type="first">Abdel</forename>
<surname>Belaïd</surname>
</persName>
<idno type="halAuthorId">129619</idno>
<orgName ref="#struct-441569"></orgName>
<affiliation ref="#struct-2362"></affiliation>
</author>
</analytic>
<monogr>
<idno type="localRef">A01-R-109 || belaïd01a</idno>
<idno type="halJournalId" status="VALID">12645</idno>
<idno type="issn">1433-2833</idno>
<idno type="eissn">1433-2825</idno>
<title level="j">International Journal on Document Analysis and Recognition</title>
<imprint>
<publisher>Springer Verlag</publisher>
<biblScope unit="volume">4</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="pp">35-45</biblScope>
<date type="datePub">2001</date>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
<profileDesc>
<langUsage>
<language ident="en">English</language>
</langUsage>
<textClass>
<keywords scheme="author">
<term xml:lang="fr">table of contents recognition -- part-of-speech tagging -- ocr combination</term>
<term xml:lang="fr">calliope project</term>
<term xml:lang="fr">digital library</term>
<term xml:lang="fr">table de matières</term>
<term xml:lang="fr">reconnaissance de documents</term>
<term xml:lang="fr">étiquetage par partie de discours</term>
<term xml:lang="fr">projet calliope</term>
</keywords>
<classCode scheme="halDomain" n="info.info-oh">Computer Science [cs]/Other [cs.OH]</classCode>
<classCode scheme="halTypology" n="ART">Journal articles</classCode>
</textClass>
<abstract xml:lang="en">A labeling approach for automatic recognition of Tables of Contents (ToC) is described in this paper. A prototype is used for electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labeling without using any a priori model. Labeling is based on a Part of Speech Tagging (PoS) which is initiated by a primary labeling of text component using some specific dictionaries. Significant tags are first grouped in homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: ``title'' and ``authors''. Non labeled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well detected articles. The designed prototype operates with a great satisfaction on different ToC layouts and character recognition qualities. Without manual intervention, 96.3\% rate of correct segmentation was obtained on 38 journals including 2020 articles and 93.0\% rate of correct field extraction.</abstract>
</profileDesc>
</hal>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Hal/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000101 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Hal/Corpus/biblio.hfd -nk 000101 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Hal
   |étape=   Corpus
   |type=    RBID
   |clé=     Hal:inria-00100452
   |texte=   Recognition of Table of Contents for Electronic Library Consulting
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024