OcrV1, PascalFrancis, Corpus, bibRecord, 000696

Recognition of table of contents for electronic library consulting

Identifieur interne : 000696 ( PascalFrancis/Corpus ); précédent : 000695; suivant : 000697

Recognition of table of contents for electronic library consulting

Auteurs : A. Belaïd

Source :

International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2001.

RBID : Pascal:02-0010779

Descripteurs français

Pascal (Inist)
- Bibliothèque électronique, Segmentation image, Problème agencement, Forme canonique, Dictionnaire, Texte, Reconnaissance forme, Reconnaissance caractère, Reconnaissance automatique, Analyse contenu, Reconnaissance optique caractère.

English descriptors

KwdEn :
- Automatic recognition, Canonical form, Character recognition, Content analysis, Dictionaries, Electronic library, Image segmentation, Layout problem, Optical character recognition, Pattern recognition, Text.

Abstract

A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

A01	`01`	`1`		`@0 1433-2833`
A03		`1`		`@0 Int. j. doc. anal. recognit. : (Print)`
A05				`@2 4`
A06				`@2 1`
A08	`01`	`1`	`ENG`	`@1 Recognition of table of contents for electronic library consulting`
A09	`01`	`1`	`ENG`	`@1 Special Issue on Document Analysis for Office Systems (Part II)`
A11	`01`	`1`		`@1 BELAÏD (A.)`
A12	`01`	`1`		`@1 DENGEL (Andreas) @9 ed.`
A12	`02`	`1`		`@1 JUNKER (Markus) @9 ed.`
A14	`01`			`@1 LORIA-CNRS Campus Scientifique, B.P. 239 @2 54506 Vandoeuvre-Lœs-Nancy @3 FRA @Z 1 aut.`
A20				`@1 35-45`
A21				`@1 2001`
A23	`01`			`@0 ENG`
A43	`01`			`@1 INIST @2 26790 @5 354000099577480040`
A44				`@0 0000 @1 © 2002 INIST-CNRS. All rights reserved.`
A45				`@0 18 ref.`
A47	`01`	`1`		`@0 02-0010779`
A60				`@1 P`
A61				`@0 A`
A64	`01`	`1`		`@0 International journal on document analysis and recognition : (Print)`
A66	`01`			`@0 DEU`
C01	`01`		`ENG`	@0 A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.
C02	`01`	`X`		`@0 001D02C03`
C03	`01`	`X`	`FRE`	`@0 Bibliothèque électronique @5 01`
C03	`01`	`X`	`ENG`	`@0 Electronic library @5 01`
C03	`01`	`X`	`SPA`	`@0 Biblioteca electronica @5 01`
C03	`02`	`1`	`FRE`	`@0 Segmentation image @5 02`
C03	`02`	`1`	`ENG`	`@0 Image segmentation @5 02`
C03	`03`	`X`	`FRE`	`@0 Problème agencement @5 03`
C03	`03`	`X`	`ENG`	`@0 Layout problem @5 03`
C03	`03`	`X`	`SPA`	`@0 Problema disposición @5 03`
C03	`04`	`X`	`FRE`	`@0 Forme canonique @5 04`
C03	`04`	`X`	`ENG`	`@0 Canonical form @5 04`
C03	`04`	`X`	`SPA`	`@0 Forma canónica @5 04`
C03	`05`	`X`	`FRE`	`@0 Dictionnaire @5 05`
C03	`05`	`X`	`ENG`	`@0 Dictionaries @5 05`
C03	`05`	`X`	`SPA`	`@0 Diccionario @5 05`
C03	`06`	`X`	`FRE`	`@0 Texte @5 06`
C03	`06`	`X`	`ENG`	`@0 Text @5 06`
C03	`06`	`X`	`SPA`	`@0 Texto @5 06`
C03	`07`	`X`	`FRE`	`@0 Reconnaissance forme @5 07`
C03	`07`	`X`	`ENG`	`@0 Pattern recognition @5 07`
C03	`07`	`X`	`SPA`	`@0 Reconocimiento patrón @5 07`
C03	`08`	`X`	`FRE`	`@0 Reconnaissance caractère @5 08`
C03	`08`	`X`	`ENG`	`@0 Character recognition @5 08`
C03	`08`	`X`	`SPA`	`@0 Reconocimiento carácter @5 08`
C03	`09`	`X`	`FRE`	`@0 Reconnaissance automatique @5 09`
C03	`09`	`X`	`ENG`	`@0 Automatic recognition @5 09`
C03	`09`	`X`	`SPA`	`@0 Reconocimiento automático @5 09`
C03	`10`	`X`	`FRE`	`@0 Analyse contenu @5 10`
C03	`10`	`X`	`ENG`	`@0 Content analysis @5 10`
C03	`10`	`X`	`SPA`	`@0 Análisis contenido @5 10`
C03	`11`	`X`	`FRE`	`@0 Reconnaissance optique caractère @5 11`
C03	`11`	`X`	`ENG`	`@0 Optical character recognition @5 11`
C03	`11`	`X`	`SPA`	`@0 Reconocimento óptico de caracteres @5 11`
N21				`@1 001`

Format Inist (serveur)

NO :	PASCAL 02-0010779 INIST
ET :	Recognition of table of contents for electronic library consulting
AU :	BELAÏD (A.); DENGEL (Andreas); JUNKER (Markus)
AF :	LORIA-CNRS Campus Scientifique, B.P. 239/54506 Vandoeuvre-Lœs-Nancy/France (1 aut.)
DT :	Publication en série; Niveau analytique
SO :	International journal on document analysis and recognition : (Print); ISSN 1433-2833; Allemagne; Da. 2001; Vol. 4; No. 1; Pp. 35-45; Bibl. 18 ref.
LA :	Anglais
EA :	A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.
CC :	001D02C03
FD :	Bibliothèque électronique; Segmentation image; Problème agencement; Forme canonique; Dictionnaire; Texte; Reconnaissance forme; Reconnaissance caractère; Reconnaissance automatique; Analyse contenu; Reconnaissance optique caractère
ED :	Electronic library; Image segmentation; Layout problem; Canonical form; Dictionaries; Text; Pattern recognition; Character recognition; Automatic recognition; Content analysis; Optical character recognition
SD :	Biblioteca electronica; Problema disposición; Forma canónica; Diccionario; Texto; Reconocimiento patrón; Reconocimiento carácter; Reconocimiento automático; Análisis contenido; Reconocimento óptico de caracteres
LO :	INIST-26790.354000099577480040
ID :	02-0010779

Links to Exploration step

Pascal:02-0010779

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Recognition of table of contents for electronic library consulting</title>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-CNRS Campus Scientifique, B.P. 239</s1>
<s2>54506 Vandoeuvre-Lœs-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">02-0010779</idno>
<date when="2001">2001</date>
<idno type="stanalyst">PASCAL 02-0010779 INIST</idno>
<idno type="RBID">Pascal:02-0010779</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000696</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Recognition of table of contents for electronic library consulting</title>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-CNRS Campus Scientifique, B.P. 239</s1>
<s2>54506 Vandoeuvre-Lœs-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2001">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic recognition</term>
<term>Canonical form</term>
<term>Character recognition</term>
<term>Content analysis</term>
<term>Dictionaries</term>
<term>Electronic library</term>
<term>Image segmentation</term>
<term>Layout problem</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Bibliothèque électronique</term>
<term>Segmentation image</term>
<term>Problème agencement</term>
<term>Forme canonique</term>
<term>Dictionnaire</term>
<term>Texte</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance automatique</term>
<term>Analyse contenu</term>
<term>Reconnaissance optique caractère</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>1433-2833</s0>
</fA01>
<fA03 i2="1"><s0>Int. j. doc. anal. recognit. : (Print)</s0>
</fA03>
<fA05><s2>4</s2>
</fA05>
<fA06><s2>1</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG"><s1>Recognition of table of contents for electronic library consulting</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>Special Issue on Document Analysis for Office Systems (Part II)</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>BELAÏD (A.)</s1>
</fA11>
<fA12 i1="01" i2="1"><s1>DENGEL (Andreas)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1"><s1>JUNKER (Markus)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01"><s1>LORIA-CNRS Campus Scientifique, B.P. 239</s1>
<s2>54506 Vandoeuvre-Lœs-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA20><s1>35-45</s1>
</fA20>
<fA21><s1>2001</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA43 i1="01"><s1>INIST</s1>
<s2>26790</s2>
<s5>354000099577480040</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2002 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>18 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>02-0010779</s0>
</fA47>
<fA60><s1>P</s1>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA64 i1="01" i2="1"><s0>International journal on document analysis and recognition : (Print)</s0>
</fA64>
<fA66 i1="01"><s0>DEU</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>001D02C03</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Bibliothèque électronique</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Electronic library</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Biblioteca electronica</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="1" l="FRE"><s0>Segmentation image</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="1" l="ENG"><s0>Image segmentation</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Problème agencement</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Layout problem</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Problema disposición</s0>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Forme canonique</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Canonical form</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Forma canónica</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Dictionnaire</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Dictionaries</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Diccionario</s0>
<s5>05</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Texte</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>Text</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Texto</s0>
<s5>06</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE"><s0>Reconnaissance forme</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG"><s0>Pattern recognition</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA"><s0>Reconocimiento patrón</s0>
<s5>07</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE"><s0>Reconnaissance caractère</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG"><s0>Character recognition</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA"><s0>Reconocimiento carácter</s0>
<s5>08</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE"><s0>Reconnaissance automatique</s0>
<s5>09</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG"><s0>Automatic recognition</s0>
<s5>09</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA"><s0>Reconocimiento automático</s0>
<s5>09</s5>
</fC03>
<fC03 i1="10" i2="X" l="FRE"><s0>Analyse contenu</s0>
<s5>10</s5>
</fC03>
<fC03 i1="10" i2="X" l="ENG"><s0>Content analysis</s0>
<s5>10</s5>
</fC03>
<fC03 i1="10" i2="X" l="SPA"><s0>Análisis contenido</s0>
<s5>10</s5>
</fC03>
<fC03 i1="11" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>11</s5>
</fC03>
<fC03 i1="11" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>11</s5>
</fC03>
<fC03 i1="11" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>11</s5>
</fC03>
<fN21><s1>001</s1>
</fN21>
</pA>
</standard>
<server><NO>PASCAL 02-0010779 INIST</NO>
<ET>Recognition of table of contents for electronic library consulting</ET>
<AU>BELAÏD (A.); DENGEL (Andreas); JUNKER (Markus)</AU>
<AF>LORIA-CNRS Campus Scientifique, B.P. 239/54506 Vandoeuvre-Lœs-Nancy/France (1 aut.)</AF>
<DT>Publication en série; Niveau analytique</DT>
<SO>International journal on document analysis and recognition : (Print); ISSN 1433-2833; Allemagne; Da. 2001; Vol. 4; No. 1; Pp. 35-45; Bibl. 18 ref.</SO>
<LA>Anglais</LA>
<EA>A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.</EA>
<CC>001D02C03</CC>
<FD>Bibliothèque électronique; Segmentation image; Problème agencement; Forme canonique; Dictionnaire; Texte; Reconnaissance forme; Reconnaissance caractère; Reconnaissance automatique; Analyse contenu; Reconnaissance optique caractère</FD>
<ED>Electronic library; Image segmentation; Layout problem; Canonical form; Dictionaries; Text; Pattern recognition; Character recognition; Automatic recognition; Content analysis; Optical character recognition</ED>
<SD>Biblioteca electronica; Problema disposición; Forma canónica; Diccionario; Texto; Reconocimiento patrón; Reconocimiento carácter; Reconocimiento automático; Análisis contenido; Reconocimento óptico de caracteres</SD>
<LO>INIST-26790.354000099577480040</LO>
<ID>02-0010779</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000696 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000696 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:02-0010779
   |texte=   Recognition of table of contents for electronic library consulting
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Recognition of table of contents for electronic library consulting

Recognition of table of contents for electronic library consulting

Source :

Descripteurs français

English descriptors

Abstract

Notice en format standard (ISO 2709)

Format Inist (serveur)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri