Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Selecting a restoration technique to minimize OCR error.

Identifieur interne : 000055 ( PubMed/Corpus ); précédent : 000054; suivant : 000056

Selecting a restoration technique to minimize OCR error.

Auteurs : M. Cannon ; M. Fugate ; D R Hush ; C. Scovel

Source :

RBID : pubmed:18238033

Abstract

This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm . Finally, we apply these methods to a collection of documents and report on the experimental results.

DOI: 10.1109/TNN.2003.811711
PubMed: 18238033

Links to Exploration step

pubmed:18238033

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Selecting a restoration technique to minimize OCR error.</title>
<author>
<name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M" last="Cannon">M. Cannon</name>
<affiliation>
<nlm:affiliation>Comput. Res. and Applications Group, Los Alamos Nat. Lab., NM, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M" last="Fugate">M. Fugate</name>
</author>
<author>
<name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D R" last="Hush">D R Hush</name>
</author>
<author>
<name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C" last="Scovel">C. Scovel</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2003">2003</date>
<idno type="doi">10.1109/TNN.2003.811711</idno>
<idno type="RBID">pubmed:18238033</idno>
<idno type="pmid">18238033</idno>
<idno type="wicri:Area/PubMed/Corpus">000055</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Selecting a restoration technique to minimize OCR error.</title>
<author>
<name sortKey="Cannon, M" sort="Cannon, M" uniqKey="Cannon M" first="M" last="Cannon">M. Cannon</name>
<affiliation>
<nlm:affiliation>Comput. Res. and Applications Group, Los Alamos Nat. Lab., NM, USA.</nlm:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Fugate, M" sort="Fugate, M" uniqKey="Fugate M" first="M" last="Fugate">M. Fugate</name>
</author>
<author>
<name sortKey="Hush, D R" sort="Hush, D R" uniqKey="Hush D" first="D R" last="Hush">D R Hush</name>
</author>
<author>
<name sortKey="Scovel, C" sort="Scovel, C" uniqKey="Scovel C" first="C" last="Scovel">C. Scovel</name>
</author>
</analytic>
<series>
<title level="j">IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council</title>
<idno type="ISSN">1045-9227</idno>
<imprint>
<date when="2003" type="published">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm . Finally, we apply these methods to a collection of documents and report on the experimental results.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Owner="NLM" Status="PubMed-not-MEDLINE">
<PMID Version="1">18238033</PMID>
<DateCreated>
<Year>2008</Year>
<Month>02</Month>
<Day>01</Day>
</DateCreated>
<DateCompleted>
<Year>2012</Year>
<Month>10</Month>
<Day>02</Day>
</DateCompleted>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">1045-9227</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>14</Volume>
<Issue>3</Issue>
<PubDate>
<Year>2003</Year>
</PubDate>
</JournalIssue>
<Title>IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council</Title>
<ISOAbbreviation>IEEE Trans Neural Netw</ISOAbbreviation>
</Journal>
<ArticleTitle>Selecting a restoration technique to minimize OCR error.</ArticleTitle>
<Pagination>
<MedlinePgn>478-90</MedlinePgn>
</Pagination>
<ELocationID EIdType="doi" ValidYN="Y">10.1109/TNN.2003.811711</ELocationID>
<Abstract>
<AbstractText>This paper introduces a learning problem related to the task of converting printed documents to ASCII text files. The goal of the learning procedure is to produce a function that maps documents to restoration techniques in such a way that on average the restored documents have minimum optical character recognition error. We derive a general form for the optimal function and use it to motivate the development of a nonparametric method based on nearest neighbors. We also develop a direct method of solution based on empirical error minimization for which we prove a finite sample bound on estimation error that is independent of distribution. We show that this empirical error minimization problem is an extension of the empirical optimization problem for traditional M-class classification with general loss function and prove computational hardness for this problem. We then derive a simple iterative algorithm called generalized multiclass ratchet (GMR) and prove that it produces an optimal function asymptotically (with probability 1). To obtain the GMR algorithm we introduce a new data map that extends Kesler's construction for the multiclass problem and then apply an algorithm called Ratchet to this mapped data, where Ratchet is a modification of the Pocket algorithm . Finally, we apply these methods to a collection of documents and report on the experimental results.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Cannon</LastName>
<ForeName>M</ForeName>
<Initials>M</Initials>
<AffiliationInfo>
<Affiliation>Comput. Res. and Applications Group, Los Alamos Nat. Lab., NM, USA.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Fugate</LastName>
<ForeName>M</ForeName>
<Initials>M</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Hush</LastName>
<ForeName>D R</ForeName>
<Initials>DR</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Scovel</LastName>
<ForeName>C</ForeName>
<Initials>C</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>IEEE Trans Neural Netw</MedlineTA>
<NlmUniqueID>101211035</NlmUniqueID>
<ISSNLinking>1045-9227</ISSNLinking>
</MedlineJournalInfo>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>2008</Year>
<Month>2</Month>
<Day>2</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2008</Year>
<Month>2</Month>
<Day>2</Day>
<Hour>9</Hour>
<Minute>1</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>2008</Year>
<Month>2</Month>
<Day>2</Day>
<Hour>9</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="doi">10.1109/TNN.2003.811711</ArticleId>
<ArticleId IdType="pubmed">18238033</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PubMed/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000055 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd -nk 000055 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PubMed
   |étape=   Corpus
   |type=    RBID
   |clé=     pubmed:18238033
   |texte=   Selecting a restoration technique to minimize OCR error.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Corpus/RBID.i   -Sk "pubmed:18238033" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024