Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Automatic error correction and query evaluation of OCR generated text

Identifieur interne : 003356 ( Main/Merge ); précédent : 003355; suivant : 003357

Automatic error correction and query evaluation of OCR generated text

Auteurs : K. Taghva [États-Unis] ; J. Borsack ; A. Condit

Source :

RBID : Pascal:94-0099345

Descripteurs français

English descriptors

Abstract

The method used in our error correction system is based on three principles: 1) approximate string matching between the misrecognized words and the terms occurring in the database as opposed to the entire dictionary 2) local information obtained from the individual documents 3) the use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device. This system is utilized to process a database composed of approximately 9300 pages of OCR generated documents.

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:94-0099345

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Automatic error correction and query evaluation of OCR generated text</title>
<author>
<name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Univ. Nevada, information sci. res. inst.</s1>
<s2>Las Vegas NV 89154-4021</s2>
<s3>USA</s3>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas NV 89154-4021</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
</author>
<author>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">94-0099345</idno>
<date when="1993">1993</date>
<idno type="stanalyst">PASCAL 94-0099345 INIST</idno>
<idno type="RBID">Pascal:94-0099345</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000B12</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000888</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000B05</idno>
<idno type="wicri:Area/Main/Merge">003356</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Automatic error correction and query evaluation of OCR generated text</title>
<author>
<name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Univ. Nevada, information sci. res. inst.</s1>
<s2>Las Vegas NV 89154-4021</s2>
<s3>USA</s3>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas NV 89154-4021</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
</author>
<author>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automation</term>
<term>Error correction</term>
<term>Evaluation</term>
<term>Experience</term>
<term>Information retrieval</term>
<term>Optical caracter recognition</term>
<term>Optical reading</term>
<term>Result</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Lecture optique</term>
<term>Recherche information</term>
<term>Correction erreur</term>
<term>Automatisation</term>
<term>Evaluation</term>
<term>Expérience</term>
<term>Résultat</term>
<term>Reconnaissance optique caractère</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Automatisation</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The method used in our error correction system is based on three principles: 1) approximate string matching between the misrecognized words and the terms occurring in the database as opposed to the entire dictionary 2) local information obtained from the individual documents 3) the use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device. This system is utilized to process a database composed of approximately 9300 pages of OCR generated documents.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
</list>
<tree>
<noCountry>
<name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</noCountry>
<country name="États-Unis">
<noRegion>
<name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 003356 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 003356 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     Pascal:94-0099345
   |texte=   Automatic error correction and query evaluation of OCR generated text
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024