Automatic error correction and query evaluation of OCR generated text
Identifieur interne : 003356 ( Main/Merge ); précédent : 003355; suivant : 003357Automatic error correction and query evaluation of OCR generated text
Auteurs : K. Taghva [États-Unis] ; J. Borsack ; A. ConditSource :
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Automatisation.
English descriptors
- KwdEn :
Abstract
The method used in our error correction system is based on three principles: 1) approximate string matching between the misrecognized words and the terms occurring in the database as opposed to the entire dictionary 2) local information obtained from the individual documents 3) the use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device. This system is utilized to process a database composed of approximately 9300 pages of OCR generated documents.
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000B12
- to stream PascalFrancis, to step Curation: 000888
- to stream PascalFrancis, to step Checkpoint: 000B05
Links to Exploration step
Pascal:94-0099345Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Automatic error correction and query evaluation of OCR generated text</title>
<author><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Univ. Nevada, information sci. res. inst.</s1>
<s2>Las Vegas NV 89154-4021</s2>
<s3>USA</s3>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas NV 89154-4021</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
</author>
<author><name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">94-0099345</idno>
<date when="1993">1993</date>
<idno type="stanalyst">PASCAL 94-0099345 INIST</idno>
<idno type="RBID">Pascal:94-0099345</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000B12</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000888</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000B05</idno>
<idno type="wicri:Area/Main/Merge">003356</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Automatic error correction and query evaluation of OCR generated text</title>
<author><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Univ. Nevada, information sci. res. inst.</s1>
<s2>Las Vegas NV 89154-4021</s2>
<s3>USA</s3>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas NV 89154-4021</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
</author>
<author><name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automation</term>
<term>Error correction</term>
<term>Evaluation</term>
<term>Experience</term>
<term>Information retrieval</term>
<term>Optical caracter recognition</term>
<term>Optical reading</term>
<term>Result</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Lecture optique</term>
<term>Recherche information</term>
<term>Correction erreur</term>
<term>Automatisation</term>
<term>Evaluation</term>
<term>Expérience</term>
<term>Résultat</term>
<term>Reconnaissance optique caractère</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Automatisation</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The method used in our error correction system is based on three principles: 1) approximate string matching between the misrecognized words and the terms occurring in the database as opposed to the entire dictionary 2) local information obtained from the individual documents 3) the use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device. This system is utilized to process a database composed of approximately 9300 pages of OCR generated documents.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
</list>
<tree><noCountry><name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
</noCountry>
<country name="États-Unis"><noRegion><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 003356 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 003356 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= Pascal:94-0099345 |texte= Automatic error correction and query evaluation of OCR generated text }}
This area was generated with Dilib version V0.6.32. |