OcrV1, Main, Exploration, bibRecordById, pubmed:18238033

Selecting a restoration technique to minimize OCR error

Identifieur interne : 001773 ( Main/Exploration ); précédent : 001772; suivant : 001774

Selecting a restoration technique to minimize OCR error

Auteurs : M. Cannon [États-Unis] ; M. Fugate ; D. R. Hush ; C. Scovel

Source :

IEEE Transactions on Neural Networks [ 1045-9227 ] ; 2003.

RBID : Pascal:03-0308647

Descripteurs français

Pascal (Inist)
- Application, Reconnaissance optique caractère, Erreur, Estimation, Traitement optique donnée, Loi probabilité, Optimisation, Complexité calcul, Algorithme, Méthode itérative, Réseau neuronal, Démonstration théorème, Système apprentissage, Théorie, Expérience.

English descriptors

KwdEn :
- Algorithms, Application, Computational complexity, Empirical error minimization, Errors, Estimation, Experiments, Generalized multiclass ratchet, Iterative methods, Learning systems, Neural networks, Optical character recognition, Optical data processing, Optimization, Pocket algorithm, Probability distributions, Restoration technique, Theorem proving, Theory.

Abstract

This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem (see, e.g., °5, p. 266) and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm °6. Finally, we apply these methods to a collection of documents and report on the experimental results.

Affiliations:

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Selecting a restoration technique to minimize OCR error</title>
<author><name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M." last="Cannon">M. Cannon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Comp. Res. and Applications Group Los Alamos National Laboratory</s1>
<s2>Los Alamos, NM 87544</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Nouveau-Mexique</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M." last="Fugate">M. Fugate</name>
</author>
<author><name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D. R." last="Hush">D. R. Hush</name>
</author>
<author><name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C." last="Scovel">C. Scovel</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">03-0308647</idno>
<date when="2003">2003</date>
<idno type="stanalyst">PASCAL 03-0308647 EI</idno>
<idno type="RBID">Pascal:03-0308647</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000611</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000180</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000537</idno>
<idno type="wicri:doubleKey">1045-9227:2003:Cannon M:selecting:a:restoration</idno>
<idno type="wicri:Area/Main/Merge">001851</idno>
<idno type="wicri:source">PubMed</idno>
<idno type="RBID">pubmed:18238033</idno>
<idno type="wicri:Area/PubMed/Corpus">000055</idno>
<idno type="wicri:Area/PubMed/Curation">000055</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000055</idno>
<idno type="wicri:Area/Ncbi/Merge">000044</idno>
<idno type="wicri:Area/Ncbi/Curation">000044</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000044</idno>
<idno type="wicri:doubleKey">1045-9227:2003:Cannon M:selecting:a:restoration</idno>
<idno type="wicri:Area/Main/Merge">001739</idno>
<idno type="wicri:Area/Main/Curation">001773</idno>
<idno type="wicri:Area/Main/Exploration">001773</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Selecting a restoration technique to minimize OCR error</title>
<author><name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M." last="Cannon">M. Cannon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Comp. Res. and Applications Group Los Alamos National Laboratory</s1>
<s2>Los Alamos, NM 87544</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Nouveau-Mexique</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M." last="Fugate">M. Fugate</name>
</author>
<author><name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D. R." last="Hush">D. R. Hush</name>
</author>
<author><name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C." last="Scovel">C. Scovel</name>
</author>
</analytic>
<series><title level="j" type="main">IEEE Transactions on Neural Networks</title>
<title level="j" type="abbreviated">IEEE Trans Neural Networks</title>
<idno type="ISSN">1045-9227</idno>
<imprint><date when="2003">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">IEEE Transactions on Neural Networks</title>
<title level="j" type="abbreviated">IEEE Trans Neural Networks</title>
<idno type="ISSN">1045-9227</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Application</term>
<term>Computational complexity</term>
<term>Empirical error minimization</term>
<term>Errors</term>
<term>Estimation</term>
<term>Experiments</term>
<term>Generalized multiclass ratchet</term>
<term>Iterative methods</term>
<term>Learning systems</term>
<term>Neural networks</term>
<term>Optical character recognition</term>
<term>Optical data processing</term>
<term>Optimization</term>
<term>Pocket algorithm</term>
<term>Probability distributions</term>
<term>Restoration technique</term>
<term>Theorem proving</term>
<term>Theory</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Application</term>
<term>Reconnaissance optique caractère</term>
<term>Erreur</term>
<term>Estimation</term>
<term>Traitement optique donnée</term>
<term>Loi probabilité</term>
<term>Optimisation</term>
<term>Complexité calcul</term>
<term>Algorithme</term>
<term>Méthode itérative</term>
<term>Réseau neuronal</term>
<term>Démonstration théorème</term>
<term>Système apprentissage</term>
<term>Théorie</term>
<term>Expérience</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper introduces a learning problem related to the task of  converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem (see, e.g., °5, p. 266) and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm °6. Finally, we apply these methods to a collection of documents and report on the experimental results.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Nouveau-Mexique</li>
</region>
</list>
<tree><noCountry><name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M." last="Fugate">M. Fugate</name>
<name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D. R." last="Hush">D. R. Hush</name>
<name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C." last="Scovel">C. Scovel</name>
</noCountry>
<country name="États-Unis"><region name="Nouveau-Mexique"><name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M." last="Cannon">M. Cannon</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001773 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001773 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:03-0308647
   |texte=   Selecting a restoration technique to minimize OCR error
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

Serveur d'exploration sur l'OCR

Selecting a restoration technique to minimize OCR error

Selecting a restoration technique to minimize OCR error

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.