Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Improved CHAID Algorithm for Document Structure Modelling

Identifieur interne : 000164 ( PascalFrancis/Corpus ); précédent : 000163; suivant : 000165

Improved CHAID Algorithm for Document Structure Modelling

Auteurs : A. Belaïd ; T. Moinel ; Y. Rangoni

Source :

RBID : Pascal:10-0429692

Descripteurs français

English descriptors

Abstract

This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

pA  
A01 01  1    @0 0277-786X
A02 01      @0 PSISDG
A03   1    @0 Proc. SPIE Int. Soc. Opt. Eng.
A05       @2 7534
A08 01  1  ENG  @1 Improved CHAID Algorithm for Document Structure Modelling
A09 01  1  ENG  @1 Document recognition and retrieval XVII : 19-21 January 2010, San Jose, California, United States
A11 01  1    @1 BELAÏD (A.)
A11 02  1    @1 MOINEL (T.)
A11 03  1    @1 RANGONI (Y.)
A12 01  1    @1 LIKFORMAN-SULEM (Laurence) @9 ed.
A12 02  1    @1 AGAM (Gady) @9 ed.
A14 01      @1 LORIA-University Nancy 2, Campus Scientifique, B.P. 239 @2 Vandœuvre-Lès-Nancy @3 FRA @Z 1 aut. @Z 2 aut. @Z 3 aut.
A18 01  1    @1 SPIE @3 USA @9 org-cong.
A18 02  1    @1 IS&T @3 USA @9 org-cong.
A18 03  1    @1 Institut TELECOM @3 FRA @9 org-cong.
A20       @2 75340X.1-75340X.7
A21       @1 2010
A23 01      @0 ENG
A25 01      @1 SPIE @2 Bellingham WA
A26 01      @0 978-0-8194-7927-3
A43 01      @1 INIST @2 21760 @5 354000174683810320
A44       @0 0000 @1 © 2010 INIST-CNRS. All rights reserved.
A45       @0 16 ref.
A47 01  1    @0 10-0429692
A60       @1 P @2 C
A61       @0 A
A64 01  1    @0 Proceedings of SPIE, the International Society for Optical Engineering
A66 01      @0 USA
C01 01    ENG  @0 This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.
C02 01  3    @0 001B00A30C
C02 02  3    @0 001B40B30S
C02 03  X    @0 001D04A05A
C02 04  X    @0 001D04A05C
C03 01  3  FRE  @0 Algorithme @5 23
C03 01  3  ENG  @0 Algorithms @5 23
C03 02  3  FRE  @0 Reconnaissance forme @5 61
C03 02  3  ENG  @0 Pattern recognition @5 61
C03 03  X  FRE  @0 Recherche documentaire @5 62
C03 03  X  ENG  @0 Document retrieval @5 62
C03 03  X  SPA  @0 Búsqueda documental @5 62
C03 04  X  FRE  @0 Structure document @5 63
C03 04  X  ENG  @0 Document structure @5 63
C03 04  X  SPA  @0 Estructura documental @5 63
C03 05  3  FRE  @0 Modélisation @5 64
C03 05  3  ENG  @0 Modelling @5 64
C03 06  3  FRE  @0 Etiquetage @5 65
C03 06  3  ENG  @0 Labelling @5 65
C03 07  3  FRE  @0 Traitement image document @5 66
C03 07  3  ENG  @0 Document image processing @5 66
C03 08  3  FRE  @0 Arbre décision @5 67
C03 08  3  ENG  @0 Decision trees @5 67
C03 09  X  FRE  @0 Etat actuel @5 68
C03 09  X  ENG  @0 State of the art @5 68
C03 09  X  SPA  @0 Estado actual @5 68
C03 10  3  FRE  @0 Reconnaissance optique caractère @5 69
C03 10  3  ENG  @0 Optical character recognition @5 69
C03 11  3  FRE  @0 0130C @4 INC @5 83
C03 12  3  FRE  @0 4230S @4 INC @5 91
C03 13  3  FRE  @0 4230V @4 INC @5 92
N21       @1 277
N44 01      @1 OTO
N82       @1 OTO
pR  
A30 01  1  ENG  @1 Document recognition and retrieval @2 17 @3 San Jose CA USA @4 2010

Format Inist (serveur)

NO : PASCAL 10-0429692 INIST
ET : Improved CHAID Algorithm for Document Structure Modelling
AU : BELAÏD (A.); MOINEL (T.); RANGONI (Y.); LIKFORMAN-SULEM (Laurence); AGAM (Gady)
AF : LORIA-University Nancy 2, Campus Scientifique, B.P. 239/Vandœuvre-Lès-Nancy/France (1 aut., 2 aut., 3 aut.)
DT : Publication en série; Congrès; Niveau analytique
SO : Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2010; Vol. 7534; 75340X.1-75340X.7; Bibl. 16 ref.
LA : Anglais
EA : This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.
CC : 001B00A30C; 001B40B30S; 001D04A05A; 001D04A05C
FD : Algorithme; Reconnaissance forme; Recherche documentaire; Structure document; Modélisation; Etiquetage; Traitement image document; Arbre décision; Etat actuel; Reconnaissance optique caractère; 0130C; 4230S; 4230V
ED : Algorithms; Pattern recognition; Document retrieval; Document structure; Modelling; Labelling; Document image processing; Decision trees; State of the art; Optical character recognition
SD : Búsqueda documental; Estructura documental; Estado actual
LO : INIST-21760.354000174683810320
ID : 10-0429692

Links to Exploration step

Pascal:10-0429692

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Improved CHAID Algorithm for Document Structure Modelling</title>
<author>
<name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation>
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Moinel, T" sort="Moinel, T" uniqKey="Moinel T" first="T." last="Moinel">T. Moinel</name>
<affiliation>
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Rangoni, Y" sort="Rangoni, Y" uniqKey="Rangoni Y" first="Y." last="Rangoni">Y. Rangoni</name>
<affiliation>
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">10-0429692</idno>
<date when="2010">2010</date>
<idno type="stanalyst">PASCAL 10-0429692 INIST</idno>
<idno type="RBID">Pascal:10-0429692</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000164</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Improved CHAID Algorithm for Document Structure Modelling</title>
<author>
<name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation>
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Moinel, T" sort="Moinel, T" uniqKey="Moinel T" first="T." last="Moinel">T. Moinel</name>
<affiliation>
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Rangoni, Y" sort="Rangoni, Y" uniqKey="Rangoni Y" first="Y." last="Rangoni">Y. Rangoni</name>
<affiliation>
<inist:fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Decision trees</term>
<term>Document image processing</term>
<term>Document retrieval</term>
<term>Document structure</term>
<term>Labelling</term>
<term>Modelling</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>State of the art</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Algorithme</term>
<term>Reconnaissance forme</term>
<term>Recherche documentaire</term>
<term>Structure document</term>
<term>Modélisation</term>
<term>Etiquetage</term>
<term>Traitement image document</term>
<term>Arbre décision</term>
<term>Etat actuel</term>
<term>Reconnaissance optique caractère</term>
<term>0130C</term>
<term>4230S</term>
<term>4230V</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0277-786X</s0>
</fA01>
<fA02 i1="01">
<s0>PSISDG</s0>
</fA02>
<fA03 i2="1">
<s0>Proc. SPIE Int. Soc. Opt. Eng.</s0>
</fA03>
<fA05>
<s2>7534</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>Improved CHAID Algorithm for Document Structure Modelling</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>Document recognition and retrieval XVII : 19-21 January 2010, San Jose, California, United States</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>BELAÏD (A.)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>MOINEL (T.)</s1>
</fA11>
<fA11 i1="03" i2="1">
<s1>RANGONI (Y.)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>LIKFORMAN-SULEM (Laurence)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>AGAM (Gady)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1">
<s1>SPIE</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="02" i2="1">
<s1>IS&T</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="03" i2="1">
<s1>Institut TELECOM</s1>
<s3>FRA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20>
<s2>75340X.1-75340X.7</s2>
</fA20>
<fA21>
<s1>2010</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA25 i1="01">
<s1>SPIE</s1>
<s2>Bellingham WA</s2>
</fA25>
<fA26 i1="01">
<s0>978-0-8194-7927-3</s0>
</fA26>
<fA43 i1="01">
<s1>INIST</s1>
<s2>21760</s2>
<s5>354000174683810320</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2010 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>16 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>10-0429692</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Proceedings of SPIE, the International Society for Optical Engineering</s0>
</fA64>
<fA66 i1="01">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.</s0>
</fC01>
<fC02 i1="01" i2="3">
<s0>001B00A30C</s0>
</fC02>
<fC02 i1="02" i2="3">
<s0>001B40B30S</s0>
</fC02>
<fC02 i1="03" i2="X">
<s0>001D04A05A</s0>
</fC02>
<fC02 i1="04" i2="X">
<s0>001D04A05C</s0>
</fC02>
<fC03 i1="01" i2="3" l="FRE">
<s0>Algorithme</s0>
<s5>23</s5>
</fC03>
<fC03 i1="01" i2="3" l="ENG">
<s0>Algorithms</s0>
<s5>23</s5>
</fC03>
<fC03 i1="02" i2="3" l="FRE">
<s0>Reconnaissance forme</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="3" l="ENG">
<s0>Pattern recognition</s0>
<s5>61</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Recherche documentaire</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Document retrieval</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Búsqueda documental</s0>
<s5>62</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Structure document</s0>
<s5>63</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Document structure</s0>
<s5>63</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Estructura documental</s0>
<s5>63</s5>
</fC03>
<fC03 i1="05" i2="3" l="FRE">
<s0>Modélisation</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="3" l="ENG">
<s0>Modelling</s0>
<s5>64</s5>
</fC03>
<fC03 i1="06" i2="3" l="FRE">
<s0>Etiquetage</s0>
<s5>65</s5>
</fC03>
<fC03 i1="06" i2="3" l="ENG">
<s0>Labelling</s0>
<s5>65</s5>
</fC03>
<fC03 i1="07" i2="3" l="FRE">
<s0>Traitement image document</s0>
<s5>66</s5>
</fC03>
<fC03 i1="07" i2="3" l="ENG">
<s0>Document image processing</s0>
<s5>66</s5>
</fC03>
<fC03 i1="08" i2="3" l="FRE">
<s0>Arbre décision</s0>
<s5>67</s5>
</fC03>
<fC03 i1="08" i2="3" l="ENG">
<s0>Decision trees</s0>
<s5>67</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE">
<s0>Etat actuel</s0>
<s5>68</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG">
<s0>State of the art</s0>
<s5>68</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA">
<s0>Estado actual</s0>
<s5>68</s5>
</fC03>
<fC03 i1="10" i2="3" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>69</s5>
</fC03>
<fC03 i1="10" i2="3" l="ENG">
<s0>Optical character recognition</s0>
<s5>69</s5>
</fC03>
<fC03 i1="11" i2="3" l="FRE">
<s0>0130C</s0>
<s4>INC</s4>
<s5>83</s5>
</fC03>
<fC03 i1="12" i2="3" l="FRE">
<s0>4230S</s0>
<s4>INC</s4>
<s5>91</s5>
</fC03>
<fC03 i1="13" i2="3" l="FRE">
<s0>4230V</s0>
<s4>INC</s4>
<s5>92</s5>
</fC03>
<fN21>
<s1>277</s1>
</fN21>
<fN44 i1="01">
<s1>OTO</s1>
</fN44>
<fN82>
<s1>OTO</s1>
</fN82>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>Document recognition and retrieval</s1>
<s2>17</s2>
<s3>San Jose CA USA</s3>
<s4>2010</s4>
</fA30>
</pR>
</standard>
<server>
<NO>PASCAL 10-0429692 INIST</NO>
<ET>Improved CHAID Algorithm for Document Structure Modelling</ET>
<AU>BELAÏD (A.); MOINEL (T.); RANGONI (Y.); LIKFORMAN-SULEM (Laurence); AGAM (Gady)</AU>
<AF>LORIA-University Nancy 2, Campus Scientifique, B.P. 239/Vandœuvre-Lès-Nancy/France (1 aut., 2 aut., 3 aut.)</AF>
<DT>Publication en série; Congrès; Niveau analytique</DT>
<SO>Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2010; Vol. 7534; 75340X.1-75340X.7; Bibl. 16 ref.</SO>
<LA>Anglais</LA>
<EA>This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.</EA>
<CC>001B00A30C; 001B40B30S; 001D04A05A; 001D04A05C</CC>
<FD>Algorithme; Reconnaissance forme; Recherche documentaire; Structure document; Modélisation; Etiquetage; Traitement image document; Arbre décision; Etat actuel; Reconnaissance optique caractère; 0130C; 4230S; 4230V</FD>
<ED>Algorithms; Pattern recognition; Document retrieval; Document structure; Modelling; Labelling; Document image processing; Decision trees; State of the art; Optical character recognition</ED>
<SD>Búsqueda documental; Estructura documental; Estado actual</SD>
<LO>INIST-21760.354000174683810320</LO>
<ID>10-0429692</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000164 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000164 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:10-0429692
   |texte=   Improved CHAID Algorithm for Document Structure Modelling
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024