Selecting a restoration technique to minimize OCR error
Identifieur interne : 001773 ( Main/Exploration ); précédent : 001772; suivant : 001774Selecting a restoration technique to minimize OCR error
Auteurs : M. Cannon [États-Unis] ; M. Fugate ; D. R. Hush ; C. ScovelSource :
- IEEE Transactions on Neural Networks [ 1045-9227 ] ; 2003.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
- Algorithms, Application, Computational complexity, Empirical error minimization, Errors, Estimation, Experiments, Generalized multiclass ratchet, Iterative methods, Learning systems, Neural networks, Optical character recognition, Optical data processing, Optimization, Pocket algorithm, Probability distributions, Restoration technique, Theorem proving, Theory.
Abstract
This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem (see, e.g., °5, p. 266) and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm °6. Finally, we apply these methods to a collection of documents and report on the experimental results.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000611
- to stream PascalFrancis, to step Curation: 000180
- to stream PascalFrancis, to step Checkpoint: 000537
- to stream Main, to step Merge: 001851
- to stream PubMed, to step Corpus: 000055
- to stream PubMed, to step Curation: 000055
- to stream PubMed, to step Checkpoint: 000055
- to stream Ncbi, to step Merge: 000044
- to stream Ncbi, to step Curation: 000044
- to stream Ncbi, to step Checkpoint: 000044
- to stream Main, to step Merge: 001739
- to stream Main, to step Curation: 001773
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Selecting a restoration technique to minimize OCR error</title>
<author><name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M." last="Cannon">M. Cannon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Comp. Res. and Applications Group Los Alamos National Laboratory</s1>
<s2>Los Alamos, NM 87544</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Nouveau-Mexique</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M." last="Fugate">M. Fugate</name>
</author>
<author><name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D. R." last="Hush">D. R. Hush</name>
</author>
<author><name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C." last="Scovel">C. Scovel</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">03-0308647</idno>
<date when="2003">2003</date>
<idno type="stanalyst">PASCAL 03-0308647 EI</idno>
<idno type="RBID">Pascal:03-0308647</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000611</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000180</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000537</idno>
<idno type="wicri:doubleKey">1045-9227:2003:Cannon M:selecting:a:restoration</idno>
<idno type="wicri:Area/Main/Merge">001851</idno>
<idno type="wicri:source">PubMed</idno>
<idno type="RBID">pubmed:18238033</idno>
<idno type="wicri:Area/PubMed/Corpus">000055</idno>
<idno type="wicri:Area/PubMed/Curation">000055</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000055</idno>
<idno type="wicri:Area/Ncbi/Merge">000044</idno>
<idno type="wicri:Area/Ncbi/Curation">000044</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000044</idno>
<idno type="wicri:doubleKey">1045-9227:2003:Cannon M:selecting:a:restoration</idno>
<idno type="wicri:Area/Main/Merge">001739</idno>
<idno type="wicri:Area/Main/Curation">001773</idno>
<idno type="wicri:Area/Main/Exploration">001773</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Selecting a restoration technique to minimize OCR error</title>
<author><name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M." last="Cannon">M. Cannon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Comp. Res. and Applications Group Los Alamos National Laboratory</s1>
<s2>Los Alamos, NM 87544</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Nouveau-Mexique</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M." last="Fugate">M. Fugate</name>
</author>
<author><name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D. R." last="Hush">D. R. Hush</name>
</author>
<author><name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C." last="Scovel">C. Scovel</name>
</author>
</analytic>
<series><title level="j" type="main">IEEE Transactions on Neural Networks</title>
<title level="j" type="abbreviated">IEEE Trans Neural Networks</title>
<idno type="ISSN">1045-9227</idno>
<imprint><date when="2003">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">IEEE Transactions on Neural Networks</title>
<title level="j" type="abbreviated">IEEE Trans Neural Networks</title>
<idno type="ISSN">1045-9227</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Application</term>
<term>Computational complexity</term>
<term>Empirical error minimization</term>
<term>Errors</term>
<term>Estimation</term>
<term>Experiments</term>
<term>Generalized multiclass ratchet</term>
<term>Iterative methods</term>
<term>Learning systems</term>
<term>Neural networks</term>
<term>Optical character recognition</term>
<term>Optical data processing</term>
<term>Optimization</term>
<term>Pocket algorithm</term>
<term>Probability distributions</term>
<term>Restoration technique</term>
<term>Theorem proving</term>
<term>Theory</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Application</term>
<term>Reconnaissance optique caractère</term>
<term>Erreur</term>
<term>Estimation</term>
<term>Traitement optique donnée</term>
<term>Loi probabilité</term>
<term>Optimisation</term>
<term>Complexité calcul</term>
<term>Algorithme</term>
<term>Méthode itérative</term>
<term>Réseau neuronal</term>
<term>Démonstration théorème</term>
<term>Système apprentissage</term>
<term>Théorie</term>
<term>Expérience</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem (see, e.g., °5, p. 266) and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm °6. Finally, we apply these methods to a collection of documents and report on the experimental results.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Nouveau-Mexique</li>
</region>
</list>
<tree><noCountry><name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M." last="Fugate">M. Fugate</name>
<name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D. R." last="Hush">D. R. Hush</name>
<name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C." last="Scovel">C. Scovel</name>
</noCountry>
<country name="États-Unis"><region name="Nouveau-Mexique"><name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M." last="Cannon">M. Cannon</name>
</region>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001773 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 001773 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:03-0308647 |texte= Selecting a restoration technique to minimize OCR error }}
This area was generated with Dilib version V0.6.32. |