An evaluation of an automatic markup system
Identifieur interne : 002D87 ( Main/Merge ); précédent : 002D86; suivant : 002D88An evaluation of an automatic markup system
Auteurs : K. Taghva [États-Unis] ; A. Condit [États-Unis] ; J. Borsack [États-Unis]Source :
- SPIE proceedings series [ 1017-2653 ] ; 1995.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy will be compared to the output of our markup system to determine its correctness.
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000958
- to stream PascalFrancis, to step Curation: 000A41
- to stream PascalFrancis, to step Checkpoint: 000A34
Links to Exploration step
Pascal:97-0124556Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">An evaluation of an automatic markup system</title>
<author><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">97-0124556</idno>
<date when="1995">1995</date>
<idno type="stanalyst">PASCAL 97-0124556 INIST</idno>
<idno type="RBID">Pascal:97-0124556</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000958</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000A41</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000A34</idno>
<idno type="wicri:doubleKey">1017-2653:1995:Taghva K:an:evaluation:of</idno>
<idno type="wicri:Area/Main/Merge">002D87</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">An evaluation of an automatic markup system</title>
<author><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><settlement type="city">Las Vegas</settlement>
<region type="state">Nevada</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint><date when="1995">1995</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Optical character recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance optique caractère</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">One predominant application of OCR is the recognition of full text documents for information retrieval. Modern retrieval systems exploit both the textual content of the document as well as its structure. The relationship between textual content and character accuracy have been the focus of recent studies. It has been shown that due to the redundancies in text, average precision and recall is not heavily affected by OCR character errors. What is not fully known is to what extent OCR devices can provide reliable information that can be used to capture the structure of the document. In this paper, we present a preliminary report on the design and evaluation of a system to automatically markup technical documents, based on information provided by an OCR device. The device we use differs from traditional OCR devices in that it not only performs optical character recognition, but also provides detailed information about page layout, word geometry, and font usage. Our automatic markup program, which we call Autotag, uses this information, combined with dictionary lookup and content analysis, to identify structural components of the text. These include the document title, author information, abstract, sections, section titles, paragraphs, sentences, and de-hyphenated words. A visual examination of the hardcopy will be compared to the output of our markup system to determine its correctness.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Nevada</li>
</region>
<settlement><li>Las Vegas</li>
</settlement>
</list>
<tree><country name="États-Unis"><region name="Nevada"><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
</region>
<name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002D87 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 002D87 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= Pascal:97-0124556 |texte= An evaluation of an automatic markup system }}
This area was generated with Dilib version V0.6.32. |