OcrV1, Main, Exploration, bibRecord, 002C01

High-performance OCR preclassification trees

Identifieur interne : 002C01 ( Main/Exploration ); précédent : 002C00; suivant : 002C02

High-performance OCR preclassification trees

Auteurs : H. S. Baird [États-Unis] ; C. L. Mallows [États-Unis]

Source :

SPIE proceedings series [ 1017-2653 ] ; 1995.

RBID : Pascal:97-0126106

Descripteurs français

Pascal (Inist)
- Reconnaissance optique caractère, Arbre décision, Classification, Document, Reconnaissance forme, Caractère imprimé, Reconnaissance caractère.
Wicri :
- topic : Classification, Document.

English descriptors

KwdEn :
- Character recognition, Classification, Decision tree, Document, Optical character recognition, Pattern recognition, Printed character.

Abstract

We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then "populate" the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error : this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user-specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree : the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000957
to stream PascalFrancis, to step Curation: 000A42
to stream PascalFrancis, to step Checkpoint: 000A09
to stream Main, to step Merge: 002D62
to stream Main, to step Curation: 002C01

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">High-performance OCR preclassification trees</title>
<author><name sortKey="Baird, H S" sort="Baird, H S" uniqKey="Baird H" first="H. S." last="Baird">H. S. Baird</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">New Jersey</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Mallows, C L" sort="Mallows, C L" uniqKey="Mallows C" first="C. L." last="Mallows">C. L. Mallows</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">New Jersey</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">97-0126106</idno>
<date when="1995">1995</date>
<idno type="stanalyst">PASCAL 97-0126106 INIST</idno>
<idno type="RBID">Pascal:97-0126106</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000957</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000A42</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000A09</idno>
<idno type="wicri:doubleKey">1017-2653:1995:Baird H:high:performance:ocr</idno>
<idno type="wicri:Area/Main/Merge">002D62</idno>
<idno type="wicri:Area/Main/Curation">002C01</idno>
<idno type="wicri:Area/Main/Exploration">002C01</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">High-performance OCR preclassification trees</title>
<author><name sortKey="Baird, H S" sort="Baird, H S" uniqKey="Baird H" first="H. S." last="Baird">H. S. Baird</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">New Jersey</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Mallows, C L" sort="Mallows, C L" uniqKey="Mallows C" first="C. L." last="Mallows">C. L. Mallows</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">New Jersey</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint><date when="1995">1995</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Classification</term>
<term>Decision tree</term>
<term>Document</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Printed character</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance optique caractère</term>
<term>Arbre décision</term>
<term>Classification</term>
<term>Document</term>
<term>Reconnaissance forme</term>
<term>Caractère imprimé</term>
<term>Reconnaissance caractère</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Classification</term>
<term>Document</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then "populate" the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error : this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user-specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree : the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>New Jersey</li>
</region>
</list>
<tree><country name="États-Unis"><region name="New Jersey"><name sortKey="Baird, H S" sort="Baird, H S" uniqKey="Baird H" first="H. S." last="Baird">H. S. Baird</name>
</region>
<name sortKey="Mallows, C L" sort="Mallows, C L" uniqKey="Mallows C" first="C. L." last="Mallows">C. L. Mallows</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002C01 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002C01 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:97-0126106
   |texte=   High-performance OCR preclassification trees
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

High-performance OCR preclassification trees

High-performance OCR preclassification trees

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri