Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Heuristics for identification of bibliographic elements from title pages

Identifieur interne : 000482 ( PascalFrancis/Checkpoint ); précédent : 000481; suivant : 000483

Heuristics for identification of bibliographic elements from title pages

Auteurs : DURGA SANKAR RATH [Inde] ; A. R. D. Prasad [Inde]

Source :

RBID : Pascal:05-0145792

Descripteurs français

English descriptors

Abstract

This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition (OCR) software, generating HTML (HyperText Markup Language) files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rule-based expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:05-0145792

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Heuristics for identification of bibliographic elements from title pages</title>
<author>
<name sortKey="Durga Sankar Rath" sort="Durga Sankar Rath" uniqKey="Durga Sankar Rath" last="Durga Sankar Rath">DURGA SANKAR RATH</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Library and Information Science, Ravindra Bharati University</s1>
<s2>Kolkata</s2>
<s3>IND</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Kolkata</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Documentation Research and Training Centre, Indian Statistical Institute</s1>
<s2>Bangalore, Karnataka</s2>
<s3>IND</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Bangalore, Karnataka</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">05-0145792</idno>
<date when="2004">2004</date>
<idno type="stanalyst">PASCAL 05-0145792 INIST</idno>
<idno type="RBID">Pascal:05-0145792</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000484</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000305</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000482</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Heuristics for identification of bibliographic elements from title pages</title>
<author>
<name sortKey="Durga Sankar Rath" sort="Durga Sankar Rath" uniqKey="Durga Sankar Rath" last="Durga Sankar Rath">DURGA SANKAR RATH</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Department of Library and Information Science, Ravindra Bharati University</s1>
<s2>Kolkata</s2>
<s3>IND</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Kolkata</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Documentation Research and Training Centre, Indian Statistical Institute</s1>
<s2>Bangalore, Karnataka</s2>
<s3>IND</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Inde</country>
<wicri:noRegion>Bangalore, Karnataka</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Library hi tech</title>
<title level="j" type="abbreviated">Libr. hi tech</title>
<idno type="ISSN">0737-8831</idno>
<imprint>
<date when="2004">2004</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Library hi tech</title>
<title level="j" type="abbreviated">Libr. hi tech</title>
<idno type="ISSN">0737-8831</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automatic recognition</term>
<term>Bibliographic data</term>
<term>Cataloging</term>
<term>Digitizing</term>
<term>Expert system</term>
<term>HTML language</term>
<term>Heuristics</term>
<term>Identification</term>
<term>Library</term>
<term>Optical character recognition</term>
<term>Title</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Bibliothèque</term>
<term>Catalogage</term>
<term>Donnée bibliographique</term>
<term>Identification</term>
<term>Titre</term>
<term>Reconnaissance automatique</term>
<term>Reconnaissance optique caractère</term>
<term>Numérisation</term>
<term>Système expert</term>
<term>Langage HTML</term>
<term>Heuristique</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Bibliothèque</term>
<term>Catalogage</term>
<term>Numérisation</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition (OCR) software, generating HTML (HyperText Markup Language) files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rule-based expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0737-8831</s0>
</fA01>
<fA02 i1="01">
<s0>LIHTD2</s0>
</fA02>
<fA03 i2="1">
<s0>Libr. hi tech</s0>
</fA03>
<fA05>
<s2>22</s2>
</fA05>
<fA06>
<s2>4</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG">
<s1>Heuristics for identification of bibliographic elements from title pages</s1>
</fA08>
<fA11 i1="01" i2="1">
<s1>DURGA SANKAR RATH</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>PRASAD (A. R. D.)</s1>
</fA11>
<fA14 i1="01">
<s1>Department of Library and Information Science, Ravindra Bharati University</s1>
<s2>Kolkata</s2>
<s3>IND</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA14 i1="02">
<s1>Documentation Research and Training Centre, Indian Statistical Institute</s1>
<s2>Bangalore, Karnataka</s2>
<s3>IND</s3>
<sZ>2 aut.</sZ>
</fA14>
<fA20>
<s1>389-396</s1>
</fA20>
<fA21>
<s1>2004</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA43 i1="01">
<s1>INIST</s1>
<s2>20448</s2>
<s5>354000121223890060</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2005 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>15 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>05-0145792</s0>
</fA47>
<fA60>
<s1>P</s1>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Library hi tech</s0>
</fA64>
<fA66 i1="01">
<s0>GBR</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition (OCR) software, generating HTML (HyperText Markup Language) files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rule-based expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>001A01E02A</s0>
</fC02>
<fC02 i1="02" i2="X">
<s0>205</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Bibliothèque</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Library</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Biblioteca</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Catalogage</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Cataloging</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Catalogación</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Donnée bibliographique</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Bibliographic data</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Dato bibliográfico</s0>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Identification</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Identification</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Identificación</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Titre</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Title</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Título</s0>
<s5>05</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE">
<s0>Reconnaissance automatique</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG">
<s0>Automatic recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA">
<s0>Reconocimiento automático</s0>
<s5>06</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>07</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE">
<s0>Numérisation</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG">
<s0>Digitizing</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA">
<s0>Numerización</s0>
<s5>08</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE">
<s0>Système expert</s0>
<s5>09</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG">
<s0>Expert system</s0>
<s5>09</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA">
<s0>Sistema experto</s0>
<s5>09</s5>
</fC03>
<fC03 i1="10" i2="X" l="FRE">
<s0>Langage HTML</s0>
<s5>10</s5>
</fC03>
<fC03 i1="10" i2="X" l="ENG">
<s0>HTML language</s0>
<s5>10</s5>
</fC03>
<fC03 i1="10" i2="X" l="SPA">
<s0>Lenguaje HTML</s0>
<s5>10</s5>
</fC03>
<fC03 i1="11" i2="X" l="FRE">
<s0>Heuristique</s0>
<s2>NI</s2>
<s4>CD</s4>
<s5>96</s5>
</fC03>
<fC03 i1="11" i2="X" l="ENG">
<s0>Heuristics</s0>
<s2>NI</s2>
<s4>CD</s4>
<s5>96</s5>
</fC03>
<fN21>
<s1>094</s1>
</fN21>
<fN44 i1="01">
<s1>PSI</s1>
</fN44>
<fN82>
<s1>PSI</s1>
</fN82>
</pA>
</standard>
</inist>
<affiliations>
<list>
<country>
<li>Inde</li>
</country>
</list>
<tree>
<country name="Inde">
<noRegion>
<name sortKey="Durga Sankar Rath" sort="Durga Sankar Rath" uniqKey="Durga Sankar Rath" last="Durga Sankar Rath">DURGA SANKAR RATH</name>
</noRegion>
<name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000482 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Checkpoint/biblio.hfd -nk 000482 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Checkpoint
   |type=    RBID
   |clé=     Pascal:05-0145792
   |texte=   Heuristics for identification of bibliographic elements from title pages
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024