Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

An evaluation of an automatic markup system

Identifieur interne : 000A34 ( PascalFrancis/Checkpoint ); précédent : 000A33; suivant : 000A35

An evaluation of an automatic markup system

Auteurs : K. Taghva [États-Unis] ; A. Condit [États-Unis] ; J. Borsack [États-Unis]

Source :

RBID : Pascal:97-0124556

Descripteurs français

English descriptors

Abstract

One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy will be compared to the output of our markup system to determine its correctness.


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:97-0124556

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">An evaluation of an automatic markup system</title>
<author>
<name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">97-0124556</idno>
<date when="1995">1995</date>
<idno type="stanalyst">PASCAL 97-0124556 INIST</idno>
<idno type="RBID">Pascal:97-0124556</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000958</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000A41</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000A34</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">An evaluation of an automatic markup system</title>
<author>
<name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint>
<date when="1995">1995</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Optical character recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance optique caractère</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy will be compared to the output of our markup system to determine its correctness.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>1017-2653</s0>
</fA01>
<fA05>
<s2>2422</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>An evaluation of an automatic markup system</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>Document recognition II : San Jose CA, 6-7 February 1995</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>TAGHVA (K.)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>CONDIT (A.)</s1>
</fA11>
<fA11 i1="03" i2="1">
<s1>BORSACK (J.)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>VINCENT (Luc M.)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>BAIRD (Henry S.)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1">
<s1>International Society for Optical Engineering</s1>
<s2>Bellingham WA</s2>
<s3>USA</s3>
<s9>patr.</s9>
</fA18>
<fA18 i1="02" i2="1">
<s1>Society for Imaging Science and Technology</s1>
<s2>Springfield VA</s2>
<s3>USA</s3>
<s9>patr.</s9>
</fA18>
<fA20>
<s1>317-327</s1>
</fA20>
<fA21>
<s1>1995</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA43 i1="01">
<s1>INIST</s1>
<s2>21760</s2>
<s5>354000053416650300</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 1997 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>19 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>97-0124556</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>SPIE proceedings series</s0>
</fA64>
<fA66 i1="01">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy will be compared to the output of our markup system to determine its correctness.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>001A01G02A</s0>
</fC02>
<fC02 i1="02" i2="X">
<s0>205</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>01</s5>
</fC03>
<fN21>
<s1>048</s1>
</fN21>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>Document recognition. Conference</s1>
<s3>San Jose CA USA</s3>
<s4>1995-02-06</s4>
</fA30>
</pR>
</standard>
</inist>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Nevada</li>
</region>
<settlement>
<li>Las Vegas</li>
</settlement>
</list>
<tree>
<country name="États-Unis">
<region name="Nevada">
<name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
</region>
<name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A34 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Checkpoint/biblio.hfd -nk 000A34 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Checkpoint
   |type=    RBID
   |clé=     Pascal:97-0124556
   |texte=   An evaluation of an automatic markup system
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024