OcrV1, Istex, Corpus, bibRecord, 000C65

Heuristics for identification of bibliographic elements from title pages

Identifieur interne : 000C65 ( Istex/Corpus ); précédent : 000C64; suivant : 000C66

Heuristics for identification of bibliographic elements from title pages

Auteurs : Durga Sankar Rath ; A. R. D. Prasad

Source :

Library Hi Tech [ 0737-8831 ] ; 2004-12-01.

RBID : ISTEX:444D56D27EBF7681527E9F282D508A59D2646702

Abstract

This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition OCR software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rulebased expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.

Url:

https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/pdf

DOI: 10.1108/07378830410570494

Links to Exploration step

ISTEX:444D56D27EBF7681527E9F282D508A59D2646702

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Heuristics for identification of bibliographic elements from title pages</title>
<author><name sortKey="Sankar Rath, Durga" sort="Sankar Rath, Durga" uniqKey="Sankar Rath D" first="Durga" last="Sankar Rath">Durga Sankar Rath</name>
<affiliation><mods:affiliation>Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
<affiliation><mods:affiliation>Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:444D56D27EBF7681527E9F282D508A59D2646702</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1108/07378830410570494</idno>
<idno type="url">https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000C65</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Heuristics for identification of bibliographic elements from title pages</title>
<author><name sortKey="Sankar Rath, Durga" sort="Sankar Rath, Durga" uniqKey="Sankar Rath D" first="Durga" last="Sankar Rath">Durga Sankar Rath</name>
<affiliation><mods:affiliation>Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Prasad, A R D" sort="Prasad, A R D" uniqKey="Prasad A" first="A. R. D." last="Prasad">A. R. D. Prasad</name>
<affiliation><mods:affiliation>Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Library Hi Tech</title>
<idno type="ISSN">0737-8831</idno>
<imprint><publisher>Emerald Group Publishing Limited</publisher>
<date type="published" when="2004-12-01">2004-12-01</date>
<biblScope unit="volume">22</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="389">389</biblScope>
<biblScope unit="page" to="396">396</biblScope>
</imprint>
<idno type="ISSN">0737-8831</idno>
</series>
<idno type="istex">444D56D27EBF7681527E9F282D508A59D2646702</idno>
<idno type="DOI">10.1108/07378830410570494</idno>
<idno type="filenameID">2380220408</idno>
<idno type="original-pdf">2380220408.pdf</idno>
<idno type="href">07378830410570494.pdf</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0737-8831</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition OCR software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rulebased expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</div>
</front>
</TEI>
<istex><corpusName>emerald</corpusName>
<author><json:item><name>Durga Sankar Rath</name>
<affiliations><json:string>Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India</json:string>
</affiliations>
</json:item>
<json:item><name>A.R.D. Prasad</name>
<affiliations><json:string>Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India</json:string>
</affiliations>
</json:item>
</author>
<subject><json:item><lang><json:string>eng</json:string>
</lang>
<value>Bibliographic systems</value>
</json:item>
<json:item><lang><json:string>eng</json:string>
</lang>
<value>Data handling</value>
</json:item>
<json:item><lang><json:string>eng</json:string>
</lang>
<value>Cataloguing</value>
</json:item>
<json:item><lang><json:string>eng</json:string>
</lang>
<value>Classification schemes</value>
</json:item>
<json:item><lang><json:string>eng</json:string>
</lang>
<value>Information operations</value>
</json:item>
</subject>
<language><json:string>eng</json:string>
</language>
<abstract>This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition OCR software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rulebased expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</abstract>
<qualityIndicators><score>6.282</score>
<pdfVersion>1.2</pdfVersion>
<pdfPageSize>610 x 789 pts</pdfPageSize>
<refBibsNative>true</refBibsNative>
<keywordCount>5</keywordCount>
<abstractCharCount>981</abstractCharCount>
<pdfWordCount>4494</pdfWordCount>
<pdfCharCount>27354</pdfCharCount>
<pdfPageCount>8</pdfPageCount>
<abstractWordCount>149</abstractWordCount>
</qualityIndicators>
<title>Heuristics for identification of bibliographic elements from title pages</title>
<genre.original><json:string>research-article</json:string>
</genre.original>
<genre><json:string>research-article</json:string>
</genre>
<host><volume>22</volume>
<publisherId><json:string>lht</json:string>
</publisherId>
<pages><last>396</last>
<first>389</first>
</pages>
<issn><json:string>0737-8831</json:string>
</issn>
<issue>4</issue>
<subject><json:item><value>Information & knowledge management</value>
</json:item>
<json:item><value>Information & communications technology</value>
</json:item>
<json:item><value>Internet</value>
</json:item>
<json:item><value>Library & information science</value>
</json:item>
<json:item><value>Information behaviour & retrieval</value>
</json:item>
<json:item><value>Librarianship/library management</value>
</json:item>
<json:item><value>Information user studies</value>
</json:item>
<json:item><value>Metadata</value>
</json:item>
<json:item><value>Library technology</value>
</json:item>
</subject>
<genre><json:string>Journal</json:string>
</genre>
<language><json:string>unknown</json:string>
</language>
<title>Library Hi Tech</title>
<doi><json:string>10.1108/lht</json:string>
</doi>
</host>
<publicationDate>2004</publicationDate>
<copyrightDate>2004</copyrightDate>
<doi><json:string>10.1108/07378830410570494</json:string>
</doi>
<id>444D56D27EBF7681527E9F282D508A59D2646702</id>
<fulltext><json:item><original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/pdf</uri>
</json:item>
<json:item><original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/tei"><teiHeader><fileDesc><titleStmt><title level="a" type="main" xml:lang="en">Heuristics for identification of bibliographic elements from title pages</title>
</titleStmt>
<publicationStmt><authority>ISTEX</authority>
<publisher>Emerald Group Publishing Limited</publisher>
<availability><p>EMERALD</p>
</availability>
<date>2004</date>
</publicationStmt>
<sourceDesc><biblStruct type="inbook"><analytic><title level="a" type="main" xml:lang="en">Heuristics for identification of bibliographic elements from title pages</title>
<author><persName><forename type="first">Durga</forename>
<surname>Sankar Rath</surname>
</persName>
<affiliation>Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India</affiliation>
</author>
<author><persName><forename type="first">A.R.D.</forename>
<surname>Prasad</surname>
</persName>
<affiliation>Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India</affiliation>
</author>
</analytic>
<monogr><title level="j">Library Hi Tech</title>
<idno type="pISSN">0737-8831</idno>
<idno type="DOI">10.1108/lht</idno>
<imprint><publisher>Emerald Group Publishing Limited</publisher>
<date type="published" when="2004-12-01"></date>
<biblScope unit="volume">22</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="389">389</biblScope>
<biblScope unit="page" to="396">396</biblScope>
</imprint>
</monogr>
<idno type="istex">444D56D27EBF7681527E9F282D508A59D2646702</idno>
<idno type="DOI">10.1108/07378830410570494</idno>
<idno type="filenameID">2380220408</idno>
<idno type="original-pdf">2380220408.pdf</idno>
<idno type="href">07378830410570494.pdf</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><creation><date>2004</date>
</creation>
<langUsage><language ident="en">en</language>
</langUsage>
<abstract xml:lang="en"><p>This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition OCR software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rulebased expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</p>
</abstract>
<textClass><keywords scheme="keyword"><list><head>Keywords</head>
<item><term>Bibliographic systems</term>
</item>
<item><term>Data handling</term>
</item>
<item><term>Cataloguing</term>
</item>
<item><term>Classification schemes</term>
</item>
<item><term>Information operations</term>
</item>
</list>
</keywords>
</textClass>
<textClass><keywords scheme="Emerald Subject Group"><list><label>cat-IKM</label>
<item><term>Information & knowledge management</term>
</item>
<label>cat-ICT</label>
<item><term>Information & communications technology</term>
</item>
<label>cat-INT</label>
<item><term>Internet</term>
</item>
</list>
</keywords>
</textClass>
<textClass><keywords scheme="Emerald Subject Group"><list><label>cat-LISC</label>
<item><term>Library & information science</term>
</item>
<label>cat-IBRT</label>
<item><term>Information behaviour & retrieval</term>
</item>
<label>cat-LLM</label>
<item><term>Librarianship/library management</term>
</item>
<label>cat-IUS</label>
<item><term>Information user studies</term>
</item>
<label>cat-MTD</label>
<item><term>Metadata</term>
</item>
<label>cat-LTC</label>
<item><term>Library technology</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc><change when="2004-12-01">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item><original>false</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/444D56D27EBF7681527E9F282D508A59D2646702/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata><istex:metadataXml wicri:clean="corpus emerald not found" wicri:toSee="no header"><istex:xmlDeclaration>version="1.0" encoding="UTF-8"</istex:xmlDeclaration>
<istex:document><!-- Auto generated NISO JATS XML created by Atypon out of MCB DTD source files. Do Not Edit! --><article dtd-version="1.0" xml:lang="en" article-type="research-article"><front><journal-meta><journal-id journal-id-type="publisher-id">lht</journal-id>
<journal-id journal-id-type="doi">10.1108/lht</journal-id>
<journal-title-group><journal-title>Library Hi Tech</journal-title>
</journal-title-group>
<issn pub-type="ppub">0737-8831</issn>
<publisher><publisher-name>Emerald Group Publishing Limited</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="doi">10.1108/07378830410570494</article-id>
<article-id pub-id-type="original-pdf">2380220408.pdf</article-id>
<article-id pub-id-type="filename">2380220408</article-id>
<article-categories><subj-group subj-group-type="type-of-publication"><compound-subject><compound-subject-part content-type="code">research-article</compound-subject-part>
<compound-subject-part content-type="label">Research paper</compound-subject-part>
</compound-subject>
</subj-group>
<subj-group subj-group-type="subject"><compound-subject><compound-subject-part content-type="code">cat-IKM</compound-subject-part>
<compound-subject-part content-type="label">Information & knowledge management</compound-subject-part>
</compound-subject>
<subj-group><compound-subject><compound-subject-part content-type="code">cat-ICT</compound-subject-part>
<compound-subject-part content-type="label">Information & communications technology</compound-subject-part>
</compound-subject>
<subj-group><compound-subject><compound-subject-part content-type="code">cat-INT</compound-subject-part>
<compound-subject-part content-type="label">Internet</compound-subject-part>
</compound-subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="subject"><compound-subject><compound-subject-part content-type="code">cat-LISC</compound-subject-part>
<compound-subject-part content-type="label">Library & information science</compound-subject-part>
</compound-subject>
<subj-group><compound-subject><compound-subject-part content-type="code">cat-IBRT</compound-subject-part>
<compound-subject-part content-type="label">Information behaviour & retrieval</compound-subject-part>
</compound-subject>
<subj-group><compound-subject><compound-subject-part content-type="code">cat-IUS</compound-subject-part>
<compound-subject-part content-type="label">Information user studies</compound-subject-part>
</compound-subject>
<compound-subject><compound-subject-part content-type="code">cat-MTD</compound-subject-part>
<compound-subject-part content-type="label">Metadata</compound-subject-part>
</compound-subject>
</subj-group>
</subj-group>
<subj-group><compound-subject><compound-subject-part content-type="code">cat-LLM</compound-subject-part>
<compound-subject-part content-type="label">Librarianship/library management</compound-subject-part>
</compound-subject>
<subj-group><compound-subject><compound-subject-part content-type="code">cat-LTC</compound-subject-part>
<compound-subject-part content-type="label">Library technology</compound-subject-part>
</compound-subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group><article-title>Heuristics for identification of bibliographic elements from title pages</article-title>
</title-group>
<contrib-group><contrib contrib-type="author"><string-name><given-names>Durga</given-names>
 <surname>Sankar Rath</surname>
</string-name>
<aff>Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India</aff>
</contrib>
<x></x>
<contrib contrib-type="author"><string-name><given-names>A.R.D.</given-names>
 <surname>Prasad</surname>
</string-name>
<aff>Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India</aff>
</contrib>
</contrib-group>
<pub-date pub-type="ppub"><day>01</day>
<month>12</month>
<year>2004</year>
</pub-date>
<volume>22</volume>
<issue>4</issue>
<fpage>389</fpage>
<lpage>396</lpage>
<permissions><copyright-statement>© Emerald Group Publishing Limited</copyright-statement>
<copyright-year>2004</copyright-year>
<license license-type="publisher"><license-p></license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="07378830410570494.pdf"></self-uri>
<abstract><p>This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition (OCR) software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rule‐based expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</p>
</abstract>
<kwd-group><kwd>Bibliographic systems</kwd>
<x>, </x>
<kwd>Data handling</kwd>
<x>, </x>
<kwd>Cataloguing</kwd>
<x>, </x>
<kwd>Classification schemes</kwd>
<x>, </x>
<kwd>Information operations</kwd>
</kwd-group>
<custom-meta-group><custom-meta><meta-name>peer-reviewed</meta-name>
<meta-value>no</meta-value>
</custom-meta>
<custom-meta><meta-name>academic-content</meta-name>
<meta-value>yes</meta-value>
</custom-meta>
<custom-meta><meta-name>rightslink</meta-name>
<meta-value>included</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body><sec><title>Introduction</title>
<p>One of the most time‐consuming technical operations in libraries is cataloguing. The cataloguing process describes each item in a collection, organizes the description into a coherent structure of relationships, and provides a tool in the form of a catalogue to access any document in a library. Although the work involved in cataloguing is very time consuming and not easily automated, libraries have long tried to reduce the amount of time and effort involved (<xref ref-type="bibr" rid="b1">Akiyama, 1990</xref>
). The process of determining bibliographic data from title pages of the documents is complex, yet systematic. An investigation of the intellectual process involved may yield a few heuristics to design an expert system paradigm that can automatically identify the bibliographic data elements from the title pages.</p>
<p>The process of descriptive cataloguing begins with the identification of bibliographic data about an item. They include the following (<xref ref-type="bibr" rid="b8">Hagler and Simmons, 1982</xref>
):<list list-type="bullet"><list-item><label>• </label>
<p>Title and Statement of Responsibility area (i.e. the name of the item and names designating its intellectual responsibility).</p>
</list-item>
<list-item><label>• </label>
<p>Edition area.</p>
</list-item>
<list-item><label>• </label>
<p>Publication and distribution area.</p>
</list-item>
<list-item><label>• </label>
<p>Series area.</p>
</list-item>
<list-item><label>• </label>
<p>Note area.</p>
</list-item>
<list-item><label>• </label>
<p>Standard Number area.</p>
</list-item>
</list>
Some bibliographic data can be easily found in the item itself, others may come from other sources. In order to ensure that all items are described in the same way using at least the same starting point for gathering data, the notion of “chief source of information”, has been introduced by cataloguers. The “chief source of information” is defined as “the source of bibliographic data to be given first preference as the source from which a bibliographic description (or portion thereof) is prepared” (<xref ref-type="bibr" rid="b7">Gorman and Winklet, 1988</xref>
). For monographs it is prescribed that “the page that occurs very near the beginning of a book that contains the most complete bibliographic information about the book” (<xref ref-type="bibr" rid="b6">Gorman and Winklet, 1978</xref>
), called the title page, is to be the first preference as a source of information for descriptive cataloguing. “The title page serves a purpose of information ... and as a means of distinction and identification” (<xref ref-type="bibr" rid="b15">Wyner, 1980</xref>
).</p>
<p>The purpose of this study is to investigate ways in which artificial intelligence techniques can be applied to cataloguing process. The basic problem is to analyze, in terms of the conceptual level and logical flows, the way a computer can be taught to recognize bibliographic elements from the title page of a document (<xref ref-type="bibr" rid="b11">Jeng, 1986</xref>
). The assignment of descriptors falls in the realm of subject indexing, where the decision making is based on subject analysis, which is an intellectual process. A few automated systems have been developed that take clues from human‐mediated expressive titles of documents (<xref ref-type="bibr" rid="b2">Aptagiri <italic>et al.</italic>
, 1995</xref>
). However, it must be noted that the present study concerns only with recognition of descriptive cataloguing elements from scanned title pages and does not deal with automatic assignment of subject descriptors to the document.</p>
<p>A considerable part of the cataloguer's expertise lies not so much in the ability to execute the rules as in the ability to recognize the bibliographic conditions which determine the choice of rules. Even greater expertise is required to understand fully the purpose of the rules and to examine existing codes in a critical manner and suggest improvements. One important result of the development of expert systems and artificial intelligence in general is that it has made us aware of the importance of knowledge in a new way. For example, medical knowledge, as opposed to medicine itself, had not really been studied in a systematic manner (<xref ref-type="bibr" rid="b9">Hjerppe and Olander, 1985</xref>
). The plethora of medical consultation systems from MYCIN onwards has helped to change that situation. To the extent that expert systems in cataloguing has value, it too is worthy of study.</p>
<p>A logical place to start is the title page. Even when confronted by a book in a language we cannot read, we can reasonably distinguish the title and the name(s) of the author(s) and publisher(s). If we could discover the heuristics we use to do this, then we might eventually be able to develop intelligent systems capable of cataloguing with the least human intervention. The interpretation of title pages is a rather complicated business, as librarianship students quickly realize when they are first taught cataloguing. Later the skills involved come to be taken for granted. Cognitive analysis using both students and experienced cataloguers might be one way of uncovering heuristics and identifying pitfalls. Another approach (<xref ref-type="bibr" rid="b3">Bertin, 1983</xref>
) is suggested by semiotics, the theory of signs. We can also study the title page from syntactic, semantic, and pragmatic points of view.</p>
<p>At the syntactic (structural) level, we consider just the layout, using clues provided by the positions of the various features, sequence of occurrence, the size of the spaces, print size, and changes in the type font. At the semantic level, words and phrases like “by”, “edited by”, “ISBN”, “Library of Congress Cataloguing‐in‐Publication Data”, could be used to identify the various bibliographic data elements from the various sections of the title page or the verso of the title page of the document. Finally, at the pragmatic level, character strings which might represent names of authors, places, or publishers could be checked against authority files for verification, and the layout of the title pages could be compared with the patterns typical of books from various publishers to aid in the identification of uncertain features (<xref ref-type="bibr" rid="b4">Burger, 1984</xref>
).</p>
<p>The present work is an attempt to study the physical features of title page – more specifically sequence of occurrence, font size, format, special characters, and so on.</p>
<sec><title>Sequence of bibliographic data elements</title>
<p>For identifying the sequence of bibliographic data elements in the title pages of monographs, a survey was conducted by physically checking the documents. Mainly four major bibliographic elements have been identified, which usually appear in the title pages of monographs. These are title, author, publisher and place of publication. However, it does not mean that only these four data elements appear in the title pages. There may be some documents where some of them may not appear – we may have three elements or even two elements. Besides data elements like series or conference information may also appear. For this purpose a sample of 485 document title pages were taken to study the sequence of the four bibliographic data elements. The results are shown in <xref ref-type="fig" rid="F_2380220408006">Table I</xref>
.</p>
<p>The study of the order of data elements on Title page shows that in more than 90 per cent of documents in the sample taken, “title” is the first in the sequence. The most common pattern found is “Title”, “Author”, “Publisher” and “Place of publication”.</p>
</sec>
</sec>
<sec><title>Study of the bibliographic data elements</title>
<p>Although we can say that the most probable sequence is TAPuPl, it should be noted that these elements do not appear as simple four lines in a title page. Definitely there is much more information in many lines about these four descriptive elements, and perhaps a few more descriptive elements. To make an analysis of the information that appears in a title, another survey has been undertaken. This survey is a minor modification of the survey conducted earlier at DRTC by Mr Madhwacharya Mundgod (<xref ref-type="bibr" rid="b12">Mundgod, 1993</xref>
).</p>
<sec><title>Data collection</title>
<p>Data have been collected in the form of a questionnaire intended to collect information on each bibliographic data element like – line number, font size, preceding field, presence of terms (e.g. “Inc”, or “Press” in case publisher field). The data collected includes reference books, text books, conference/seminar proceedings. Only English language books have been taken into consideration. The method of data collection adopted was systematic sampling (<xref ref-type="bibr" rid="b5">Cox and Miller, 1982</xref>
).</p>
</sec>
</sec>
<sec><title>Data analysis</title>
<sec><title>Presence of bibliographic information</title>
<p>Out of the 500 samples, 499 have title, 458 have author, 483 have publisher, 441 have place fields. Frequency of occurrence of other fields, are as follows: sub‐title (164), volume (4), edition (48), conference proceedings (34), year (161), series (39). Following sections present a summary of the analysis of various bibliographic descriptive elements.</p>
</sec>
<sec><title>Title information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>Titles are found in upper or upper middle portion of the title page.</p>
</list-item>
<list-item><label>• </label>
<p>The title appears as first in the title page (75.15 per cent) (In few cases author or series occupies first position.)</p>
</list-item>
<list-item><label>• </label>
<p>Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields.</p>
</list-item>
<list-item><label>• </label>
<p>If the title and sub‐title occurred in the same line, they are separated by “:” (colon) or “‐” (hyphen).</p>
</list-item>
<list-item><label>• </label>
<p>It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others.</p>
</list-item>
<list-item><label>• </label>
<p>Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”.</p>
</list-item>
</list>
</sec>
<sec><title>Sub‐title information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>If a sub‐title occurs, it immediately follows the title.</p>
</list-item>
<list-item><label>• </label>
<p>The sub‐title occurs in between second and fifth line (94.74 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>If the title and sub‐title are of same font size, they are separated by “:” or “ a line” (horizontal line), or a “blank space” (a vertical space between the title and sub‐title and the vertical height of blank space will be more than the vertical height of characters used in the title string).</p>
</list-item>
<list-item><label>• </label>
<p>The sub‐title may also have terms like “a”, “an”, “the”, “to”.</p>
</list-item>
<list-item><label>• </label>
<p>Like the title, the sub‐title may have numerals, punctuation marks.</p>
</list-item>
</list>
</sec>
<sec><title>Edition information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>Mostly edition is found in reference books (e.g. Encyclopaedias, Dictionaries, Manuals), because reference books require periodical updating.</p>
</list-item>
<list-item><label>• </label>
<p>Always the term “Edition” appears in the “edition” string.</p>
</list-item>
<list-item><label>• </label>
<p>The edition string generally consists of an edition number written either in numerals or by using alphabetic characters. (e.g. Edition 2, Second Edition).</p>
</list-item>
<list-item><label>• </label>
<p>The edition field is either preceded by the author (43.18 per cent), or title (34.10 per cent) or sub‐title (18.18 per cent). That is, the edition field occurs after author, or title or sub‐title field.</p>
</list-item>
</list>
</sec>
<sec><title>Volume information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>A volume field is often found in reference books.</p>
</list-item>
<list-item><label>• </label>
<p>The term “Volume” or “V” or “Vol.” is present in the volume string.</p>
</list-item>
<list-item><label>• </label>
<p>A volume string generally consists of a volume number. The volume number may follow or precede the term “Volume”.</p>
</list-item>
<list-item><label>• </label>
<p>Title or Author are most probable preceding fields for volume field.</p>
</list-item>
</list>
</sec>
<sec><title>Author/contributor information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>The author field usually occurs in the third or fourth line (51.96 per cent). Less frequently, it occurs in fifth line with 13.97 per cent, or the first line with 9.83 per cent, or the sixth line with 9.3 per cent.</p>
</list-item>
<list-item><label>• </label>
<p>Terms like “edited by”, “by”, “editor”, occur in the author field (32.75 per cent) of which ”Edited” by occurs more frequently (54.67 per cent) and “By” with 30.00 per cent.</p>
</list-item>
<list-item><label>• </label>
<p>Usually the author name does not exceed three or four words.</p>
</list-item>
<list-item><label>• </label>
<p>It is most common that the author field has a single author (70.74 per cent). Less frequently it will have two authors (23.80 per cent) and still less frequently three authors (4.37 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>In case of multiple authorship, authors' names are separated by “Different line” (78.36 per cent), or “and” or “&” (17.91 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The usual preceding fields for the author field are, title (57.14 per cent), sub‐title (29.54 per cent), conference proceedings (5.33 per cent) and edition (3.63 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The horizontal position of author field is centered (57.21 per cent), or left‐aligned (34.28 per cent), or right‐aligned (8.51 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>More than half (55.90 per cent) of the author fields have author affiliation. Usually it will be in italic fonts.</p>
</list-item>
</list>
</sec>
<sec><title>Conference proceedings information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>This field occurs less frequently (6.8 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The probable line numbers for conference proceedings are 4th line (35.29 per cent), 3rd line (17.65 per cent), 2nd line (11.76 per cent), 6th line (8.82 per cent), 8th line (8.82 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The field normally has any one or more of the following terms: Conference; Proceedings; Seminar; Symposium; Workshop.</p>
</list-item>
<list-item><label>• </label>
<p>The title with 87.5 per cent, or the sub‐title with 12.5 per cent are only the preceding fields for conference proceedings fields, if it did not occur in the first line.</p>
</list-item>
</list>
</sec>
<sec><title>Publisher information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>The publisher field appears in the lower portion of the title.</p>
</list-item>
<list-item><label>• </label>
<p>It is common that the publisher field occurs in between fifth to ninth line (64.8 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The publisher field contains a symbol, or publisher's logo, or publisher's trademark. (45.96 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The symbol, or logo, or trademark appears in preceding line (60.36 per cent). Sometimes these precede (34.68 per cent) or follow (4.96 per cent) the publisher's name.</p>
</list-item>
<list-item><label>• </label>
<p>The terms like “Inc”, “Press”, “Published by”, occur frequently in publisher field. “Press” occurs with 29.21 per cent, “Publishing” with 22.86 per cent, “Company” with 16.51 per cent, “Inc” with 16.19 per cent and “Publishers(s)” with 7.30 per cent.</p>
</list-item>
<list-item><label>• </label>
<p>The publisher field will be of third‐largest font size (51.67 per cent), or second‐largest font (31.88 per cent) and less frequently of fourth‐largest with 10.35 per cent.</p>
</list-item>
<list-item><label>• </label>
<p>The most probable preceding field for the publisher field is the author (61.88 per cent). Less frequently it precedes the year with 10.00 per cent, or title with 8.95 per cent, or place with 6.25 per cent, or sub‐title with 4.37 per cent.</p>
</list-item>
</list>
</sec>
<sec><title>Place of publication information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>The place name also appears in the lower portion of the title page.</p>
</list-item>
<list-item><label>• </label>
<p>Frequently it occurs in between 6th to 10th line (63.23 per cent) and less frequently in 12th line (7.26 per cent) or fifth line (6.12 per cent) or 11th line (5.67 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>If both the publisher's name and the place name occur in the same line, they will be separated by a “Blank space” (47.75 per cent) or “.” (a dot) (18.92 per cent), or by “,” (11.71 per cent) or by the publisher's logo (9.91 per cent). And less frequently “/” or “‐”are also used to separate publisher's name and place name.</p>
</list-item>
<list-item><label>• </label>
<p>A blank space (40.97 per cent) or a dot (“.”) (25.69 per cent), or “,” (12.15 per cent), or “and” or “&” (11.11 per cent) are used to separate place names, if there are more than one place name. A different line, or “/” or “‐” are also used less frequently to separate place names.</p>
</list-item>
<list-item><label>• </label>
<p>It is common that in most cases, the “place” field follows the publisher field (90.93 per cent). In some cases the author follows (5.90 per cent), or year (1.36 per cent).</p>
</list-item>
</list>
</sec>
<sec><title>Year of publication information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>The occurrence of year field is not so frequent (32.2 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The year field also appears in the lower portion of the title page.</p>
</list-item>
<list-item><label>• </label>
<p>The year field always has four digit numerals (in a few cases a year will be printed, like May 1988).</p>
</list-item>
<list-item><label>• </label>
<p>The year field occurs in the last line (31.68 per cent), or follows the place name (31.06 per cent) (i.e. both place name and year appear in the same line. Even that could be last line) or before the publisher's name with 29.19 per cent.</p>
</list-item>
<list-item><label>• </label>
<p>The place name frequently precedes the year field (63.75 per cent) or the author field (24.38 per cent). And less frequently the publisher field precedes the year field (6.87 per cent).</p>
</list-item>
</list>
</sec>
<sec><title>Series information</title>
<list list-type="bullet"><list-item><label>• </label>
<p>The most common position of the series field is the first line (92.5 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>In a series string, the term “series” is found (45.00 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>The series string usually ends with numerals, which indicates the series number (82.5 per cent).</p>
</list-item>
<list-item><label>• </label>
<p>If a series occurs in something other than the first line, either it precedes the author field (2.5 per cent) or the publisher (2.5 per cent) or the conference proceedings (2.5 per cent).</p>
</list-item>
</list>
</sec>
<sec><title>Rule base derived through heuristics</title>
<sec><title>Heuristics for title</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF a line appears in the first block of the Title Page,AND IF it is in the largest font,AND IF it contains some English words,THEN that element may be the Title of the document.</p>
<p><italic>Two</italic>
</p>
<p>IF the same size font continues in the succeeding lines,THEN other lines should be included in the Title.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for other title information</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF a ‘:’ or ‘‐’ appears in the Title,THEN the words following ‘:’ or ‘‐‘, constitute the Subtitle.</p>
<p><italic>Two</italic>
</p>
<p>IF the Title continues to the next line,AND IF the following line is in a lesser font,THEN the following part may be the Subtitle.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for edition</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF a line appears in between Title and Author Block (e.g. First vertical space or break),AND IF it appears in a separate line,AND IF it contains some specific words (i.e. edition, ed.),THEN it may be the edition statement of the document.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for volume</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF an element appears in between Title and Author Block (e.g. First vertical space or break),AND IF it appears in a separate line,AND IF it contains some specific words (e.g. Part, or, Volume, or, Vol., V.),THEN it may be the volume statement of the document.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for author/contributor</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF anything (string/s) appears in between Title and Publisher blocks,AND IF it happens to be in largest font between title and publisher block,AND IF those are not the English words, (English words may belong to authors' affiliation)THEN that element may be the Author of the document.</p>
<p><italic>Two</italic>
</p>
<p>IF anything (line(s)) appears before Title,AND IF it does not contain any English words, (English words may constitute series statement)THEN it may be the Author of the document.</p>
<p><italic>Three</italic>
</p>
<p>IF the Author element is either followed, or, preceded by some specific terms (‘editor’, ‘edited’),THEN it may be the Editor of the concerned document.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for conference proceedings</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF a line or continuous lines contain(s) some specific words (e.g. Proceedings, Seminar, Conference, Symposium, and some numeric figures),THEN it may be the information about Conference Proceedings.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for publisher</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF some elements are found after the largest gap (vertical space), (it is presumed that the gap between author and publisher is more than the gap between title and author)AND IF any line of that Block (i.e. the last block) is in larger font,THEN that may be the Publisher of the document.</p>
<p><italic>Two</italic>
</p>
<p>IF anything matches with the limited set of Publishers' lexicon,THEN that may be the Publisher of the document.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for place of publication</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF an element happens to be the last line of the last Block (after the largest gap),THEN that may be the Place(s) of Publication.</p>
<p><italic>Two</italic>
</p>
<p>IF anything matches with the list of the Place in the lexicon,THEN those may be the Places of Publication.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for year of publication</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF anything appears with ’19..’ or, ’20..’, in the last blockAND IF that happens to be consist of four numeric figures,THEN it may be the year of Publication.</p>
</disp-quote>
</sec>
<sec><title>Heuristics for series</title>
<disp-quote><p><italic>One</italic>
</p>
<p>IF anything appears before Title,AND IF contains some of the terms (e.g. series, endowment, some numbers,)THEN it may be the Series of Publication of the document.</p>
<p><italic>Two</italic>
</p>
<p>IF anything appears after Title,AND IF contains some of the terms (e.g. series, endowment, some numbers)THEN it may be the Series of Publication of the document.</p>
</disp-quote>
</sec>
</sec>
</sec>
<sec><title>Program for identification of bibliographic data elements from title page</title>
<p>After the Title Page is scanned, using OmniPage Pro 10 (<xref ref-type="bibr" rid="b13">OmniPage, 2001</xref>
), it is saved in HTML file format. OmniPage software can generate a plain text (ASCII) file, however, since such file does not contain information about the font size, font face and other physical characteristics required for the present study, the HTML file format is used. The intention is to use this HTML page, which gives clues regarding physical features like font size, font face, to identify the bibliographic data elements from the Title Page using a program. Even for the creation of an online data base for tables of contents of books, this kind of model could be developed (<xref ref-type="bibr" rid="b10">Jett <italic>et al.</italic>
, 1998</xref>
). The rules derived from the heuristics are implemented in a Java program. Most of the artificial intelligence systems or expert systems are normally implemented in either PROLOG (Programming in Logic) or LISP, though C or Java are not uncommon. However, the problem with Prolog is it lacks good I/O operations. Hence, Java programming language was chosen for the present operation. Since the input is in HTML format, a tokenizer is required to identify the HTML tags. Firstly, the HTML tokenizer is described, this tokenizer is an adoption of the tokenizer given by <xref ref-type="bibr" rid="b14">Vanhelsuwe <italic>et al.</italic>
 (1996</xref>
).</p>
<sec><title>HTML Tokenizer</title>
<p>Tokenizing some input means reducing it to a simpler stream of tokens. These tokens represent recurring chunks of data in the stream. Any Java compiler, for example, would check for grammatical correctness of the programs by checking the sequence of tokens representing reserved word strings like class, import, public, void, and so on. By not having to actually deal with the exact character sequences themselves, tokenizing as a technique has the following two main advantages:<list list-type="order"><list-item><label>1. </label>
<p>It reduces code complexity.</p>
</list-item>
<list-item><label>2. </label>
<p>It allows for flexible, quick changes in input syntax.</p>
</list-item>
</list>
Class “StreamTokenizer” can be used to turn any input stream into a stream of tokens. The programming model for the class is that a stream can contain three types of entities:<list list-type="order"><list-item><label>1. </label>
<p>Words (that is, multicharacter tokens).</p>
</list-item>
<list-item><label>2. </label>
<p>Single character tokens.</p>
</list-item>
<list-item><label>3. </label>
<p>Whitespace (including C/C++/Java‐style comments).</p>
</list-item>
</list>
Before we start processing a stream into tokens, we have to define which ASCII characters should be treated as one of the three possible input types, called “defining the syntax table for the stream”.</p>
</sec>
</sec>
<sec><title>Sample pages</title>
<p>In the previous sections, the discussion is on the physical study of the title pages, the heuristics developed from it, then the corresponding rule base and the program part of it. All of these are aimed at the automatic identification of the bibliographic data elements from the Title pages of the document.</p>
<sec><title>Sample of the actual page</title>
<p>For automatic identification purpose, the title pages of the documents are scanned. For this purpose, a HP ScanJet 6100C flatbed scanner, and the OmnipagePro 10 software are used. See <xref ref-type="fig" rid="F_2380220408001">Figure 1</xref>
 and <xref ref-type="fig" rid="F_2380220408002">Figure 2</xref>
 for an example of an original title page and the HTML result.</p>
</sec>
<sec><title>Sample of the program output</title>
<p>The HTML Tokenizer (written in Java) is used for parsing the input files. If we use the HTML file in <xref ref-type="fig" rid="F_2380220408002">Figure 2</xref>
 as the input to the system, we get the following output:</p>
<disp-quote><p>7 H1 ArialData: DATA7 H1 ArialData: ABSTRACTION0 null nullData: BR4 null ArialData: THE OBJECT‐ORIENTED4 null ArialData: APPROACH USING C++0 null nullData: BR0 null nullData: BR4 null ArialData: Joseph Bergin0 null nullData: BR4 null Times RomanData: Pace University0 null nullData: BR0 null nullData: BR0 null nullData: BR0 null nullData: BR7 H1 ArialData: McGraw‐Hill, Inc.0 null nullData: BR2 null ArialData: New York St. Louis San Francisco Auckland Bogota Caracas Lisbon London Madrid Mexico City Milan Montreal New Delhi San Juan Singapore Sydney Tokyo Toronto</p>
</disp-quote>
</sec>
<sec><title>Output for the SampleTitle Page</title>
<disp-quote><p>TI: Data AbstractionPU: Mcgraw‐Hill, Inc. (Used largest Font)PL: TokyoPL: New YorkPL: LondonPL: SydneyPL: TorontoPL: SingaporePL: DelhiAU: Joseph Bergin (Author after title)OT: The Object‐Oriented Approach Using C++</p>
</disp-quote>
<p>Note: the first part of the output is meant to check whether the HTML tokenizer correctly identified the data, font type, font face. The second part is the actual output of the program, i.e. after identification of the bibliographic data elements. The information in parenthesis is to provide an idea about which heuristic rule is used to identify the element.</p>
</sec>
</sec>
<sec><title>Conclusion</title>
<p>After development of the program, we have collected 50 documents randomly for final crosschecking. Then we used the scanned title pages (HTML files) as input to the program. The outcome is fairly promising. In 46 cases we have got the outcome clearly. Some of the problems observed in the study are enumerated as follows.</p>
<sec><title>Problems relating to OCR</title>
<list list-type="bullet"><list-item><label>• </label>
<p>OmnipagePro 10 claims more than 99 per cent accuracy, and it is found the accuracy is impressive for original documents.</p>
</list-item>
<list-item><label>• </label>
<p>We have taken photocopies of the Title pages because of the logistic problems in borrowing books from the library. From the photocopies we have generated the HTML pages using OmniPage. In some cases, the recognition is not completely satisfactory, as photocopies lose quality.</p>
</list-item>
<list-item><label>• </label>
<p>In the case of light colored printing, the OCR has some difficulties. Mostly these problems arise when the letters are not printed in black. Recognition is best when the page is in black and white.</p>
</list-item>
<list-item><label>• </label>
<p>In this study the library collection was used. The library stamp and barcode sticker in the Title page can confuse the OCR, and sometimes causes distortion to the final output in HTML.</p>
</list-item>
<list-item><label>• </label>
<p>The HTML file of the scanned page misses some of the vertical spaces (breaks), which create problems in applying heuristics. This is a serious problem, especially when it removes many lines in between the author and publisher blocks. In some cases, though, the gap between the Author block and the Publisher block is bigger than other gaps in original page, but the scanned title page in HTML format shows that gap as smaller than other gaps.</p>
</list-item>
<list-item><label>• </label>
<p>Sometimes, it happens that the data elements of different lines appear in the same line in the scanned HTML page. In a few cases, the reverse also happens, i.e. a scanned line is split into two lines in the HTML document.</p>
</list-item>
<list-item><label>• </label>
<p>When an emblem or logo is adjacent to the publisher name, a part of the Publisher's name goes above the logo and another part of the publisher's name goes under the logo in the scanned HTML page. For example, in case of “Cambridge University Press”, the line having the word “Cambridge” appears above the logo, whereas the line having “University Press” appears under the logo, as if it is a place name.</p>
</list-item>
<list-item><label>• </label>
<p>In brief, although the character recognition of the OmniPage Pro is impressive, conversion to the HTML document is not always reliable. Since the present study focuses on the physical layout of the title pages, it becomes absolutely necessary that the OCR software should produce an accurate representation of the title page.</p>
</list-item>
</list>
</sec>
<sec><title>Problems relating to the program</title>
<list list-type="bullet"><list-item><label>• </label>
<p>If a person's name appears before the title, but it is neither the author, nor a series title, the system finds it difficult to identify some of the descriptive elements. In the same title page, if a list of authors’ names is given after title, this program goes astray, since personal name(s) can appear before the title and after the title. <xref ref-type="fig" rid="F_2380220408003">Figure 3</xref>
 shows an example.</p>
</list-item>
<list-item><label>• </label>
<p>Another serious problem arises when there is a series title, and then the first line of the title is in a smaller font size, and next line of the Title is in the largest font. So the system fails to recognize the first line as the part of the Title (see <xref ref-type="fig" rid="F_2380220408004">Figure 4</xref>
 and <xref ref-type="fig" rid="F_2380220408005">Figure 5</xref>
).</p>
</list-item>
</list>
<p>In the second example (Figure 5), the title is scattered across many lines, and different font sizes have been used to present various segments of the title. In such cases, the system fails in the identification process. The reason is that, the basic heuristics for identification is that the “title appears in the largest font in consecutive lines”. The system counts as the title the first line where the largest font starts, and considers the title last line to be where the largest font ends. The rest of the title in the smaller font is considered as other title information, such as a subtitle.</p>
</sec>
</sec>
<sec><fig position="float" id="F_2380220408001"><label><bold>Figure 1<x> </x>
</bold>
</label>
<caption><p>Sample of a scanned title page</p>
</caption>
<graphic xlink:href="2380220408001.tif"></graphic>
</fig>
</sec>
<sec><fig position="float" id="F_2380220408002"><label><bold>Figure 2<x> </x>
</bold>
</label>
<caption><p>Scanned title page in TML format</p>
</caption>
<graphic xlink:href="2380220408002.tif"></graphic>
</fig>
</sec>
<sec><fig position="float" id="F_2380220408003"><label><bold>Figure 3<x> </x>
</bold>
</label>
<caption><p>Example of name appearing before title</p>
</caption>
<graphic xlink:href="2380220408003.tif"></graphic>
</fig>
</sec>
<sec><fig position="float" id="F_2380220408004"><label><bold>Figure 4<x> </x>
</bold>
</label>
<caption><p>Title problem (example 1)</p>
</caption>
<graphic xlink:href="2380220408004.tif"></graphic>
</fig>
</sec>
<sec><fig position="float" id="F_2380220408005"><label><bold>Figure 5<x> </x>
</bold>
</label>
<caption><p>Title problem (example 2)</p>
</caption>
<graphic xlink:href="2380220408005.tif"></graphic>
</fig>
</sec>
<sec><fig position="float" id="F_2380220408006"><label><bold>Table I<x> </x>
</bold>
</label>
<caption><p>Probability of each sequence</p>
</caption>
<graphic xlink:href="2380220408006.tif"></graphic>
</fig>
</sec>
</body>
<back><ref-list><title>References</title>
<ref id="b1"><mixed-citation><person-group person-group-type="author"><string-name><surname>Akiyama</surname>
, <given-names>T.N.</given-names>
</string-name>
</person-group>
 (<year>1990</year>
), “<article-title><italic>Automated entry system for printed documents</italic>
</article-title>
”, <source><italic>Pattern recognition</italic>
</source>
, Vol. <volume>23</volume>
 No. <issue>11</issue>
, pp. <fpage>1141</fpage>
<x>‐</x>
<lpage>54</lpage>
.</mixed-citation>
</ref>
<ref id="b2"><mixed-citation><person-group person-group-type="author"><string-name><surname>Aptagiri</surname>
, <given-names>D.V.</given-names>
</string-name>
</person-group>
, <person-group person-group-type="author"><string-name><surname>Gopinath</surname>
, <given-names>M.A.</given-names>
</string-name>
</person-group>
 and <person-group person-group-type="author"><string-name><surname>Prasad</surname>
, <given-names>A.R.D.</given-names>
</string-name>
</person-group>
 (<year>1995</year>
), “<article-title><italic>A frame based knowledge representation paradigm for automating POPSI</italic>
</article-title>
”, <source><italic>Knowledge Organisation</italic>
</source>
, Vol. <volume>22</volume>
 No. <issue>3/4</issue>
, pp. <fpage>162</fpage>
<x>‐</x>
<lpage>7</lpage>
.</mixed-citation>
</ref>
<ref id="b3"><mixed-citation><person-group person-group-type="author"><string-name><surname>Bertin</surname>
, <given-names>J.</given-names>
</string-name>
</person-group>
 (<year>1983</year>
), <source><italic>Semiology of Graphics: Diagrams, Networks, Maps</italic>
</source>
, <publisher-name>University of Wisconsin Press</publisher-name>
, <publisher-loc>Madison, WI</publisher-loc>
.</mixed-citation>
</ref>
<ref id="b4"><mixed-citation><person-group person-group-type="author"><string-name><surname>Burger</surname>
, <given-names>R.H.</given-names>
</string-name>
</person-group>
 (<year>1984</year>
), “<article-title><italic>Artificial intelligence and authority control</italic>
</article-title>
”, <source><italic>Library Resources and Technical Services</italic>
</source>
, Vol. <volume>28</volume>
, pp. <fpage>337</fpage>
<x>‐</x>
<lpage>45</lpage>
.</mixed-citation>
</ref>
<ref id="b5"><mixed-citation><person-group person-group-type="author"><string-name><surname>Cox</surname>
, <given-names>D.R.</given-names>
</string-name>
</person-group>
 and <person-group person-group-type="author"><string-name><surname>Miller</surname>
, <given-names>H.D.</given-names>
</string-name>
</person-group>
 (<year>1982</year>
), <source><italic>The Theory of Stochastic Process</italic>
</source>
, <publisher-name>John Wiley & Sons</publisher-name>
, <publisher-loc>New York, NY</publisher-loc>
, pp. <fpage>203</fpage>
<x>‐</x>
<lpage>51</lpage>
.</mixed-citation>
</ref>
<ref id="b6"><mixed-citation><person-group person-group-type="author"><string-name><surname>Gorman</surname>
, <given-names>M.</given-names>
</string-name>
</person-group>
 and <person-group person-group-type="author"><string-name><surname>Winklet</surname>
, <given-names>P.W.</given-names>
  <given-names>(Eds)</given-names>
</string-name>
</person-group>
 (<year>1978</year>
), <source><italic>Anglo‐American Cataloguing Rules</italic>
</source>
, <edition>2nd ed.</edition>
, <publisher-name>ALA</publisher-name>
, <publisher-loc>Chicago, IL</publisher-loc>
.</mixed-citation>
</ref>
<ref id="b7"><mixed-citation><person-group person-group-type="author"><string-name><surname>Gorman</surname>
, <given-names>M.</given-names>
</string-name>
</person-group>
 and <person-group person-group-type="author"><string-name><surname>Winklet</surname>
, <given-names>P.W.</given-names>
  <given-names>(Eds)</given-names>
</string-name>
</person-group>
 (<year>1988</year>
), <source><italic>Anglo‐American Cataloguing Rules</italic>
</source>
, <edition>2nd rev. ed.</edition>
, <publisher-name>ALA</publisher-name>
, <publisher-loc>Chicago, IL</publisher-loc>
, p. <fpage>616</fpage>
.</mixed-citation>
</ref>
<ref id="b8"><mixed-citation><person-group person-group-type="author"><string-name><surname>Hagler</surname>
, <given-names>R.</given-names>
</string-name>
</person-group>
 and <person-group person-group-type="author"><string-name><surname>Simmons</surname>
, <given-names>P.</given-names>
</string-name>
</person-group>
 (<year>1982</year>
), <source><italic>The Bibliographic Record and Information Technology</italic>
</source>
, <publisher-name>ALA</publisher-name>
, <publisher-loc>Chicago, IL</publisher-loc>
, p. <fpage>118</fpage>
.</mixed-citation>
</ref>
<ref id="b9"><mixed-citation><person-group person-group-type="author"><string-name><surname>Hjerppe</surname>
, <given-names>R.</given-names>
</string-name>
</person-group>
 and <person-group person-group-type="author"><string-name><surname>Olander</surname>
, <given-names>B.</given-names>
</string-name>
</person-group>
 (<year>1985</year>
), <source><italic>Artificial Intelligence and Cataloguing</italic>
</source>
, <publisher-name>Linkoping University</publisher-name>
, <publisher-loc>Linkoping</publisher-loc>
.</mixed-citation>
</ref>
<ref id="b11"><mixed-citation><person-group person-group-type="author"><string-name><surname>Jeng</surname>
, <given-names>L.‐H.</given-names>
</string-name>
</person-group>
 (<year>1986</year>
), “<article-title><italic>An expert system for determining title proper in descriptive cataloguing: a conceptual model</italic>
</article-title>
”, <source><italic>Cataloguing and Classification Quarterly</italic>
</source>
, Vol. <volume>7</volume>
 No. <issue>2</issue>
, pp. <fpage>55</fpage>
<x>‐</x>
<lpage>69</lpage>
.</mixed-citation>
</ref>
<ref id="b10"><mixed-citation><person-group person-group-type="author"><string-name><surname>Jett</surname>
, <given-names>M.</given-names>
</string-name>
</person-group>
, <person-group person-group-type="author"><string-name><surname>Reuse</surname>
, <given-names>B.</given-names>
</string-name>
</person-group>
 and <person-group person-group-type="author"><string-name><surname>Kessling</surname>
, <given-names>G.</given-names>
</string-name>
</person-group>
 (<year>1998</year>
), “<article-title><italic>Implementation of an online database for tables of contents of books</italic>
</article-title>
”, <source><italic>Electronic Library</italic>
</source>
, Vol. <volume>16</volume>
 No. <issue>2</issue>
, pp. <fpage>123</fpage>
<x>‐</x>
<lpage>30</lpage>
.</mixed-citation>
</ref>
<ref id="b12"><mixed-citation><person-group person-group-type="author"><string-name><surname>Mundgod</surname>
, <given-names>M.</given-names>
</string-name>
</person-group>
 (<year>1993</year>
), “<article-title><italic>Application of optical character recognition and expert system cataloguing: a state of the art report</italic>
</article-title>
”, <source><italic>Guided Project II</italic>
</source>
, <publisher-name>DRTC</publisher-name>
, <publisher-loc>Bangalore</publisher-loc>
, pp. <fpage>35</fpage>
<x>‐</x>
<lpage>62</lpage>
.</mixed-citation>
</ref>
<ref id="b13"><mixed-citation><person-group person-group-type="author"><string-name>Omnipage</string-name>
</person-group>
 (<year>2001</year>
), <source><italic>OmniPage Pro 10</italic>
</source>
, available at: <ext-link ext-link-type="uri" xlink:href="http://www.caere.com/products/omnipage/pro/">www.caere.com/products/omnipage/pro/</ext-link>
, <publisher-name>Scansoft Inc.</publisher-name>
, <publisher-loc>Peabody, MA</publisher-loc>
.</mixed-citation>
</ref>
<ref id="b14"><mixed-citation><person-group person-group-type="author"><string-name><surname>Vanhelsuwe</surname>
, <given-names>L.</given-names>
 .</string-name>
</person-group>
 (<year>1996</year>
), <source><italic>Mastering Java</italic>
</source>
, <publisher-name>BPB Publications</publisher-name>
, <publisher-loc>New Delhi</publisher-loc>
, pp. <fpage>360</fpage>
<x>‐</x>
<lpage>5</lpage>
.</mixed-citation>
</ref>
<ref id="b15"><mixed-citation><person-group person-group-type="author"><string-name><surname>Wyner</surname>
, <given-names>B.S.</given-names>
</string-name>
</person-group>
 (<year>1980</year>
), <source><italic>Introduction to Cataloguing and Classification</italic>
</source>
, <edition>6th ed</edition>
, <publisher-name>Libraries Unlimited</publisher-name>
, <publisher-loc>Littleton, CO</publisher-loc>
, p. <fpage>640</fpage>
.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
</istex:document>
</istex:metadataXml>
<mods version="3.6"><titleInfo lang="en"><title>Heuristics for identification of bibliographic elements from title pages</title>
</titleInfo>
<titleInfo type="alternative" lang="en" contentType="CDATA"><title>Heuristics for identification of bibliographic elements from title pages</title>
</titleInfo>
<name type="personal"><namePart type="given">Durga</namePart>
<namePart type="family">Sankar Rath</namePart>
<affiliation>Lecturer in the Department of Library and Information Science, Ravindra Bharati University, Kolkata, India</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">A.R.D.</namePart>
<namePart type="family">Prasad</namePart>
<affiliation>Associate Professor, Documentation Research and Training Centre, Indian Statistical Institute, Bangalore, Karnataka, India</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="research-article"></genre>
<originInfo><publisher>Emerald Group Publishing Limited</publisher>
<dateIssued encoding="w3cdtf">2004-12-01</dateIssued>
<copyrightDate encoding="w3cdtf">2004</copyrightDate>
</originInfo>
<language><languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<physicalDescription><internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract lang="en">This paper presents a methodology for automatic identification of bibliographic data elements from the title pages of books. Also enumerates the various steps like scanning the title pages, running optical character recognition OCR software, generating HTML files out of title pages and applying heuristics to identify the bibliographic data elements. Much of the paper deals with the surveys undertaken to analyze the characteristics of various bibliographic descriptive elements like title, author, publisher and other elements. The first survey deals with the sequence of the bibliographic data in the title pages. The second survey deals with the font size, font type and the proximity of each bibliographic element on the title pages. The survey results are then used to develop heuristics, in order to develop a rulebased expert system which can identify the bibliographic elements on the title pages. The results of the system are presented, along with problems encountered.</abstract>
<subject><genre>Keywords</genre>
<topic>Bibliographic systems</topic>
<topic>Data handling</topic>
<topic>Cataloguing</topic>
<topic>Classification schemes</topic>
<topic>Information operations</topic>
</subject>
<relatedItem type="host"><titleInfo><title>Library Hi Tech</title>
</titleInfo>
<genre type="Journal">journal</genre>
<subject><genre>Emerald Subject Group</genre>
<topic authority="SubjectCodesPrimary" authorityURI="cat-IKM">Information & knowledge management</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-ICT">Information & communications technology</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-INT">Internet</topic>
</subject>
<subject><genre>Emerald Subject Group</genre>
<topic authority="SubjectCodesPrimary" authorityURI="cat-LISC">Library & information science</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-IBRT">Information behaviour & retrieval</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-LLM">Librarianship/library management</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-IUS">Information user studies</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-MTD">Metadata</topic>
<topic authority="SubjectCodesSecondary" authorityURI="cat-LTC">Library technology</topic>
</subject>
<identifier type="ISSN">0737-8831</identifier>
<identifier type="PublisherID">lht</identifier>
<identifier type="DOI">10.1108/lht</identifier>
<part><date>2004</date>
<detail type="volume"><caption>vol.</caption>
<number>22</number>
</detail>
<detail type="issue"><caption>no.</caption>
<number>4</number>
</detail>
<extent unit="pages"><start>389</start>
<end>396</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">444D56D27EBF7681527E9F282D508A59D2646702</identifier>
<identifier type="DOI">10.1108/07378830410570494</identifier>
<identifier type="filenameID">2380220408</identifier>
<identifier type="original-pdf">2380220408.pdf</identifier>
<identifier type="href">07378830410570494.pdf</identifier>
<accessCondition type="use and reproduction" contentType="copyright">© Emerald Group Publishing Limited</accessCondition>
<recordInfo><recordContentSource>EMERALD</recordContentSource>
</recordInfo>
</mods>
</metadata>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Istex/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C65 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000C65 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:444D56D27EBF7681527E9F282D508A59D2646702
   |texte=   Heuristics for identification of bibliographic elements from title pages
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Heuristics for identification of bibliographic elements from title pages

Heuristics for identification of bibliographic elements from title pages

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri