Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

High-performance OCR preclassification trees

Identifieur interne : 000A42 ( PascalFrancis/Curation ); précédent : 000A41; suivant : 000A43

High-performance OCR preclassification trees

Auteurs : H. S. Baird [États-Unis] ; C. L. Mallows [États-Unis]

Source :

RBID : Pascal:97-0126106

Descripteurs français

English descriptors

Abstract

We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then "populate" the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error : this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user-specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree : the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.
pA  
A01 01  1    @0 1017-2653
A05       @2 2422
A08 01  1  ENG  @1 High-performance OCR preclassification trees
A09 01  1  ENG  @1 Document recognition II : San Jose CA, 6-7 February 1995
A11 01  1    @1 BAIRD (H. S.)
A11 02  1    @1 MALLOWS (C. L.)
A12 01  1    @1 VINCENT (Luc M.) @9 ed.
A12 02  1    @1 BAIRD (Henry S.) @9 ed.
A14 01      @1 AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322 @2 Murray Hill, NJ 07974-0636 @3 USA @Z 1 aut. @Z 2 aut.
A18 01  1    @1 International Society for Optical Engineering @2 Bellingham WA @3 USA @9 patr.
A18 02  1    @1 Society for Imaging Science and Technology @2 Springfield VA @3 USA @9 patr.
A20       @1 47-53
A21       @1 1995
A23 01      @0 ENG
A43 01      @1 INIST @2 21760 @5 354000053416650050
A44       @0 0000 @1 © 1997 INIST-CNRS. All rights reserved.
A45       @0 9 ref.
A47 01  1    @0 97-0126106
A60       @1 P @2 C
A61       @0 A
A64 01  1    @0 SPIE proceedings series
A66 01      @0 USA
C01 01    ENG  @0 We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then "populate" the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error : this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user-specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree : the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.
C02 01  X    @0 001A01G02A
C02 02  X    @0 205
C03 01  X  FRE  @0 Reconnaissance optique caractère @5 05
C03 01  X  ENG  @0 Optical character recognition @5 05
C03 01  X  SPA  @0 Reconocimento óptico de caracteres @5 05
C03 02  X  FRE  @0 Arbre décision @5 06
C03 02  X  ENG  @0 Decision tree @5 06
C03 02  X  SPA  @0 Arbol decisión @5 06
C03 03  X  FRE  @0 Classification @5 07
C03 03  X  ENG  @0 Classification @5 07
C03 03  X  GER  @0 Klassifizierung @5 07
C03 03  X  SPA  @0 Clasificación @5 07
C03 04  X  FRE  @0 Document @5 12
C03 04  X  ENG  @0 Document @5 12
C03 04  X  SPA  @0 Documento @5 12
C03 05  X  FRE  @0 Reconnaissance forme @5 13
C03 05  X  ENG  @0 Pattern recognition @5 13
C03 05  X  GER  @0 Mustererkennung @5 13
C03 05  X  SPA  @0 Reconocimiento patrón @5 13
C03 06  X  FRE  @0 Caractère imprimé @5 14
C03 06  X  ENG  @0 Printed character @5 14
C03 06  X  SPA  @0 Carácter impreso @5 14
C03 07  X  FRE  @0 Reconnaissance caractère @5 15
C03 07  X  ENG  @0 Character recognition @5 15
C03 07  X  SPA  @0 Reconocimiento carácter @5 15
N21       @1 048
pR  
A30 01  1  ENG  @1 Document recognition. Conference @3 San Jose CA USA @4 1995-02-06

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:97-0126106

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">High-performance OCR preclassification trees</title>
<author>
<name sortKey="Baird, H S" sort="Baird, H S" uniqKey="Baird H" first="H. S." last="Baird">H. S. Baird</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Mallows, C L" sort="Mallows, C L" uniqKey="Mallows C" first="C. L." last="Mallows">C. L. Mallows</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">97-0126106</idno>
<date when="1995">1995</date>
<idno type="stanalyst">PASCAL 97-0126106 INIST</idno>
<idno type="RBID">Pascal:97-0126106</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000957</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000A42</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">High-performance OCR preclassification trees</title>
<author>
<name sortKey="Baird, H S" sort="Baird, H S" uniqKey="Baird H" first="H. S." last="Baird">H. S. Baird</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Mallows, C L" sort="Mallows, C L" uniqKey="Mallows C" first="C. L." last="Mallows">C. L. Mallows</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint>
<date when="1995">1995</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Classification</term>
<term>Decision tree</term>
<term>Document</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Printed character</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance optique caractère</term>
<term>Arbre décision</term>
<term>Classification</term>
<term>Document</term>
<term>Reconnaissance forme</term>
<term>Caractère imprimé</term>
<term>Reconnaissance caractère</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Classification</term>
<term>Document</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then "populate" the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error : this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user-specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree : the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>1017-2653</s0>
</fA01>
<fA05>
<s2>2422</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>High-performance OCR preclassification trees</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>Document recognition II : San Jose CA, 6-7 February 1995</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>BAIRD (H. S.)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>MALLOWS (C. L.)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>VINCENT (Luc M.)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>BAIRD (Henry S.)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322</s1>
<s2>Murray Hill, NJ 07974-0636</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1">
<s1>International Society for Optical Engineering</s1>
<s2>Bellingham WA</s2>
<s3>USA</s3>
<s9>patr.</s9>
</fA18>
<fA18 i1="02" i2="1">
<s1>Society for Imaging Science and Technology</s1>
<s2>Springfield VA</s2>
<s3>USA</s3>
<s9>patr.</s9>
</fA18>
<fA20>
<s1>47-53</s1>
</fA20>
<fA21>
<s1>1995</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA43 i1="01">
<s1>INIST</s1>
<s2>21760</s2>
<s5>354000053416650050</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 1997 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>9 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>97-0126106</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>SPIE proceedings series</s0>
</fA64>
<fA66 i1="01">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>We present an automatic method for constructing high-performance preclassification decision trees for OCR. Good preclassifiers prune the set of alternative classes to many fewer without erroneously pruning the correct class. We build the decision tree using greedy entropy minimization, using pseudo-randomly generated training samples derived from a model of imaging defects, and then "populate" the tree with many more samples to drive down the error rate. In [BM94] we presented a statistically rigorous stopping rule for population that enforces a user-specified upper bound on error : this works in practice, but is too conservative, driving the error far below the bound. Here, we describe a refinement that achieves the user-specified accuracy more closely and thus improves the pruning rate of the resulting tree. The method exploits the structure of the tree : the essential technical device is a leaf-selection rule based on Good's Theorem [Good53]. We illustrate its effectiveness through experiments on a pan-European polyfont classifier.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>001A01G02A</s0>
</fC02>
<fC02 i1="02" i2="X">
<s0>205</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>05</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>05</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>05</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Arbre décision</s0>
<s5>06</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Decision tree</s0>
<s5>06</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Arbol decisión</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Classification</s0>
<s5>07</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Classification</s0>
<s5>07</s5>
</fC03>
<fC03 i1="03" i2="X" l="GER">
<s0>Klassifizierung</s0>
<s5>07</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Clasificación</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Document</s0>
<s5>12</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Document</s0>
<s5>12</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Documento</s0>
<s5>12</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Reconnaissance forme</s0>
<s5>13</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Pattern recognition</s0>
<s5>13</s5>
</fC03>
<fC03 i1="05" i2="X" l="GER">
<s0>Mustererkennung</s0>
<s5>13</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Reconocimiento patrón</s0>
<s5>13</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE">
<s0>Caractère imprimé</s0>
<s5>14</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG">
<s0>Printed character</s0>
<s5>14</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA">
<s0>Carácter impreso</s0>
<s5>14</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE">
<s0>Reconnaissance caractère</s0>
<s5>15</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG">
<s0>Character recognition</s0>
<s5>15</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA">
<s0>Reconocimiento carácter</s0>
<s5>15</s5>
</fC03>
<fN21>
<s1>048</s1>
</fN21>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>Document recognition. Conference</s1>
<s3>San Jose CA USA</s3>
<s4>1995-02-06</s4>
</fA30>
</pR>
</standard>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000A42 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000A42 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Curation
   |type=    RBID
   |clé=     Pascal:97-0126106
   |texte=   High-performance OCR preclassification trees
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024