Improved CHAID Algorithm for Document Structure Modelling
Identifieur interne :
000164 ( PascalFrancis/Corpus );
précédent :
000163;
suivant :
000165
Improved CHAID Algorithm for Document Structure Modelling
Auteurs : A. Belaïd ;
T. Moinel ;
Y. RangoniSource :
-
Proceedings of SPIE, the International Society for Optical Engineering [ 0277-786X ] ; 2010.
RBID : Pascal:10-0429692
Descripteurs français
- Pascal (Inist)
- Algorithme,
Reconnaissance forme,
Recherche documentaire,
Structure document,
Modélisation,
Etiquetage,
Traitement image document,
Arbre décision,
Etat actuel,
Reconnaissance optique caractère,
0130C,
4230S,
4230V.
English descriptors
Abstract
This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.
Notice en format standard (ISO 2709)
Pour connaître la documentation sur le format Inist Standard.
pA |
A01 | 01 | 1 | | @0 0277-786X |
---|
A02 | 01 | | | @0 PSISDG |
---|
A03 | | 1 | | @0 Proc. SPIE Int. Soc. Opt. Eng. |
---|
A05 | | | | @2 7534 |
---|
A08 | 01 | 1 | ENG | @1 Improved CHAID Algorithm for Document Structure Modelling |
---|
A09 | 01 | 1 | ENG | @1 Document recognition and retrieval XVII : 19-21 January 2010, San Jose, California, United States |
---|
A11 | 01 | 1 | | @1 BELAÏD (A.) |
---|
A11 | 02 | 1 | | @1 MOINEL (T.) |
---|
A11 | 03 | 1 | | @1 RANGONI (Y.) |
---|
A12 | 01 | 1 | | @1 LIKFORMAN-SULEM (Laurence) @9 ed. |
---|
A12 | 02 | 1 | | @1 AGAM (Gady) @9 ed. |
---|
A14 | 01 | | | @1 LORIA-University Nancy 2, Campus Scientifique, B.P. 239 @2 Vandœuvre-Lès-Nancy @3 FRA @Z 1 aut. @Z 2 aut. @Z 3 aut. |
---|
A18 | 01 | 1 | | @1 SPIE @3 USA @9 org-cong. |
---|
A18 | 02 | 1 | | @1 IS&T @3 USA @9 org-cong. |
---|
A18 | 03 | 1 | | @1 Institut TELECOM @3 FRA @9 org-cong. |
---|
A20 | | | | @2 75340X.1-75340X.7 |
---|
A21 | | | | @1 2010 |
---|
A23 | 01 | | | @0 ENG |
---|
A25 | 01 | | | @1 SPIE @2 Bellingham WA |
---|
A26 | 01 | | | @0 978-0-8194-7927-3 |
---|
A43 | 01 | | | @1 INIST @2 21760 @5 354000174683810320 |
---|
A44 | | | | @0 0000 @1 © 2010 INIST-CNRS. All rights reserved. |
---|
A45 | | | | @0 16 ref. |
---|
A47 | 01 | 1 | | @0 10-0429692 |
---|
A60 | | | | @1 P @2 C |
---|
A61 | | | | @0 A |
---|
A64 | 01 | 1 | | @0 Proceedings of SPIE, the International Society for Optical Engineering |
---|
A66 | 01 | | | @0 USA |
---|
C01 | 01 | | ENG | @0 This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%. |
---|
C02 | 01 | 3 | | @0 001B00A30C |
---|
C02 | 02 | 3 | | @0 001B40B30S |
---|
C02 | 03 | X | | @0 001D04A05A |
---|
C02 | 04 | X | | @0 001D04A05C |
---|
C03 | 01 | 3 | FRE | @0 Algorithme @5 23 |
---|
C03 | 01 | 3 | ENG | @0 Algorithms @5 23 |
---|
C03 | 02 | 3 | FRE | @0 Reconnaissance forme @5 61 |
---|
C03 | 02 | 3 | ENG | @0 Pattern recognition @5 61 |
---|
C03 | 03 | X | FRE | @0 Recherche documentaire @5 62 |
---|
C03 | 03 | X | ENG | @0 Document retrieval @5 62 |
---|
C03 | 03 | X | SPA | @0 Búsqueda documental @5 62 |
---|
C03 | 04 | X | FRE | @0 Structure document @5 63 |
---|
C03 | 04 | X | ENG | @0 Document structure @5 63 |
---|
C03 | 04 | X | SPA | @0 Estructura documental @5 63 |
---|
C03 | 05 | 3 | FRE | @0 Modélisation @5 64 |
---|
C03 | 05 | 3 | ENG | @0 Modelling @5 64 |
---|
C03 | 06 | 3 | FRE | @0 Etiquetage @5 65 |
---|
C03 | 06 | 3 | ENG | @0 Labelling @5 65 |
---|
C03 | 07 | 3 | FRE | @0 Traitement image document @5 66 |
---|
C03 | 07 | 3 | ENG | @0 Document image processing @5 66 |
---|
C03 | 08 | 3 | FRE | @0 Arbre décision @5 67 |
---|
C03 | 08 | 3 | ENG | @0 Decision trees @5 67 |
---|
C03 | 09 | X | FRE | @0 Etat actuel @5 68 |
---|
C03 | 09 | X | ENG | @0 State of the art @5 68 |
---|
C03 | 09 | X | SPA | @0 Estado actual @5 68 |
---|
C03 | 10 | 3 | FRE | @0 Reconnaissance optique caractère @5 69 |
---|
C03 | 10 | 3 | ENG | @0 Optical character recognition @5 69 |
---|
C03 | 11 | 3 | FRE | @0 0130C @4 INC @5 83 |
---|
C03 | 12 | 3 | FRE | @0 4230S @4 INC @5 91 |
---|
C03 | 13 | 3 | FRE | @0 4230V @4 INC @5 92 |
---|
N21 | | | | @1 277 |
---|
N44 | 01 | | | @1 OTO |
---|
N82 | | | | @1 OTO |
---|
|
pR |
A30 | 01 | 1 | ENG | @1 Document recognition and retrieval @2 17 @3 San Jose CA USA @4 2010 |
---|
|
Format Inist (serveur)
NO : | PASCAL 10-0429692 INIST |
ET : | Improved CHAID Algorithm for Document Structure Modelling |
AU : | BELAÏD (A.); MOINEL (T.); RANGONI (Y.); LIKFORMAN-SULEM (Laurence); AGAM (Gady) |
AF : | LORIA-University Nancy 2, Campus Scientifique, B.P. 239/Vandœuvre-Lès-Nancy/France (1 aut., 2 aut., 3 aut.) |
DT : | Publication en série; Congrès; Niveau analytique |
SO : | Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2010; Vol. 7534; 75340X.1-75340X.7; Bibl. 16 ref. |
LA : | Anglais |
EA : | This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%. |
CC : | 001B00A30C; 001B40B30S; 001D04A05A; 001D04A05C |
FD : | Algorithme; Reconnaissance forme; Recherche documentaire; Structure document; Modélisation; Etiquetage; Traitement image document; Arbre décision; Etat actuel; Reconnaissance optique caractère; 0130C; 4230S; 4230V |
ED : | Algorithms; Pattern recognition; Document retrieval; Document structure; Modelling; Labelling; Document image processing; Decision trees; State of the art; Optical character recognition |
SD : | Búsqueda documental; Estructura documental; Estado actual |
LO : | INIST-21760.354000174683810320 |
ID : | 10-0429692 |
Links to Exploration step
Pascal:10-0429692
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Improved CHAID Algorithm for Document Structure Modelling</title>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Moinel, T" sort="Moinel, T" uniqKey="Moinel T" first="T." last="Moinel">T. Moinel</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Rangoni, Y" sort="Rangoni, Y" uniqKey="Rangoni Y" first="Y." last="Rangoni">Y. Rangoni</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">10-0429692</idno>
<date when="2010">2010</date>
<idno type="stanalyst">PASCAL 10-0429692 INIST</idno>
<idno type="RBID">Pascal:10-0429692</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000164</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Improved CHAID Algorithm for Document Structure Modelling</title>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Moinel, T" sort="Moinel, T" uniqKey="Moinel T" first="T." last="Moinel">T. Moinel</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Rangoni, Y" sort="Rangoni, Y" uniqKey="Rangoni Y" first="Y." last="Rangoni">Y. Rangoni</name>
<affiliation><inist:fA14 i1="01"><s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint><date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Algorithms</term>
<term>Decision trees</term>
<term>Document image processing</term>
<term>Document retrieval</term>
<term>Document structure</term>
<term>Labelling</term>
<term>Modelling</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>State of the art</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Algorithme</term>
<term>Reconnaissance forme</term>
<term>Recherche documentaire</term>
<term>Structure document</term>
<term>Modélisation</term>
<term>Etiquetage</term>
<term>Traitement image document</term>
<term>Arbre décision</term>
<term>Etat actuel</term>
<term>Reconnaissance optique caractère</term>
<term>0130C</term>
<term>4230S</term>
<term>4230V</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>0277-786X</s0>
</fA01>
<fA02 i1="01"><s0>PSISDG</s0>
</fA02>
<fA03 i2="1"><s0>Proc. SPIE Int. Soc. Opt. Eng.</s0>
</fA03>
<fA05><s2>7534</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG"><s1>Improved CHAID Algorithm for Document Structure Modelling</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>Document recognition and retrieval XVII : 19-21 January 2010, San Jose, California, United States</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>BELAÏD (A.)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>MOINEL (T.)</s1>
</fA11>
<fA11 i1="03" i2="1"><s1>RANGONI (Y.)</s1>
</fA11>
<fA12 i1="01" i2="1"><s1>LIKFORMAN-SULEM (Laurence)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1"><s1>AGAM (Gady)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01"><s1>LORIA-University Nancy 2, Campus Scientifique, B.P. 239</s1>
<s2>Vandœuvre-Lès-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1"><s1>SPIE</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="02" i2="1"><s1>IS&T</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="03" i2="1"><s1>Institut TELECOM</s1>
<s3>FRA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20><s2>75340X.1-75340X.7</s2>
</fA20>
<fA21><s1>2010</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA25 i1="01"><s1>SPIE</s1>
<s2>Bellingham WA</s2>
</fA25>
<fA26 i1="01"><s0>978-0-8194-7927-3</s0>
</fA26>
<fA43 i1="01"><s1>INIST</s1>
<s2>21760</s2>
<s5>354000174683810320</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2010 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>16 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>10-0429692</s0>
</fA47>
<fA60><s1>P</s1>
<s2>C</s2>
</fA60>
<fA64 i1="01" i2="1"><s0>Proceedings of SPIE, the International Society for Optical Engineering</s0>
</fA64>
<fA66 i1="01"><s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.</s0>
</fC01>
<fC02 i1="01" i2="3"><s0>001B00A30C</s0>
</fC02>
<fC02 i1="02" i2="3"><s0>001B40B30S</s0>
</fC02>
<fC02 i1="03" i2="X"><s0>001D04A05A</s0>
</fC02>
<fC02 i1="04" i2="X"><s0>001D04A05C</s0>
</fC02>
<fC03 i1="01" i2="3" l="FRE"><s0>Algorithme</s0>
<s5>23</s5>
</fC03>
<fC03 i1="01" i2="3" l="ENG"><s0>Algorithms</s0>
<s5>23</s5>
</fC03>
<fC03 i1="02" i2="3" l="FRE"><s0>Reconnaissance forme</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="3" l="ENG"><s0>Pattern recognition</s0>
<s5>61</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Recherche documentaire</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Document retrieval</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Búsqueda documental</s0>
<s5>62</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Structure document</s0>
<s5>63</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Document structure</s0>
<s5>63</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Estructura documental</s0>
<s5>63</s5>
</fC03>
<fC03 i1="05" i2="3" l="FRE"><s0>Modélisation</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="3" l="ENG"><s0>Modelling</s0>
<s5>64</s5>
</fC03>
<fC03 i1="06" i2="3" l="FRE"><s0>Etiquetage</s0>
<s5>65</s5>
</fC03>
<fC03 i1="06" i2="3" l="ENG"><s0>Labelling</s0>
<s5>65</s5>
</fC03>
<fC03 i1="07" i2="3" l="FRE"><s0>Traitement image document</s0>
<s5>66</s5>
</fC03>
<fC03 i1="07" i2="3" l="ENG"><s0>Document image processing</s0>
<s5>66</s5>
</fC03>
<fC03 i1="08" i2="3" l="FRE"><s0>Arbre décision</s0>
<s5>67</s5>
</fC03>
<fC03 i1="08" i2="3" l="ENG"><s0>Decision trees</s0>
<s5>67</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE"><s0>Etat actuel</s0>
<s5>68</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG"><s0>State of the art</s0>
<s5>68</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA"><s0>Estado actual</s0>
<s5>68</s5>
</fC03>
<fC03 i1="10" i2="3" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>69</s5>
</fC03>
<fC03 i1="10" i2="3" l="ENG"><s0>Optical character recognition</s0>
<s5>69</s5>
</fC03>
<fC03 i1="11" i2="3" l="FRE"><s0>0130C</s0>
<s4>INC</s4>
<s5>83</s5>
</fC03>
<fC03 i1="12" i2="3" l="FRE"><s0>4230S</s0>
<s4>INC</s4>
<s5>91</s5>
</fC03>
<fC03 i1="13" i2="3" l="FRE"><s0>4230V</s0>
<s4>INC</s4>
<s5>92</s5>
</fC03>
<fN21><s1>277</s1>
</fN21>
<fN44 i1="01"><s1>OTO</s1>
</fN44>
<fN82><s1>OTO</s1>
</fN82>
</pA>
<pR><fA30 i1="01" i2="1" l="ENG"><s1>Document recognition and retrieval</s1>
<s2>17</s2>
<s3>San Jose CA USA</s3>
<s4>2010</s4>
</fA30>
</pR>
</standard>
<server><NO>PASCAL 10-0429692 INIST</NO>
<ET>Improved CHAID Algorithm for Document Structure Modelling</ET>
<AU>BELAÏD (A.); MOINEL (T.); RANGONI (Y.); LIKFORMAN-SULEM (Laurence); AGAM (Gady)</AU>
<AF>LORIA-University Nancy 2, Campus Scientifique, B.P. 239/Vandœuvre-Lès-Nancy/France (1 aut., 2 aut., 3 aut.)</AF>
<DT>Publication en série; Congrès; Niveau analytique</DT>
<SO>Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2010; Vol. 7534; 75340X.1-75340X.7; Bibl. 16 ref.</SO>
<LA>Anglais</LA>
<EA>This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-mining method employed here is the "Improved CHi-squared Automatic Interaction Detection" (I-CHAID). The contribution of this work is the insertion of logical rules extracted from the logical layout knowledge to support the decision tree. Two setups have been tested; the first uses one tree per logical element, the second one uses a single tree for all the logical elements we want to recognise. The main system, implemented in Java, coordinates the third-party tools (Omnipage for the OCR part, and SIPINA for the I-CHAID algorithm) using XML and XSL transforms. It was tested on around 1000 documents belonging to the ICPR'04 and ICPR'08 conference proceedings, representing about 16,000 blocks. The final error rate for determining the logical labels (among 9 different ones) is less than 6%.</EA>
<CC>001B00A30C; 001B40B30S; 001D04A05A; 001D04A05C</CC>
<FD>Algorithme; Reconnaissance forme; Recherche documentaire; Structure document; Modélisation; Etiquetage; Traitement image document; Arbre décision; Etat actuel; Reconnaissance optique caractère; 0130C; 4230S; 4230V</FD>
<ED>Algorithms; Pattern recognition; Document retrieval; Document structure; Modelling; Labelling; Document image processing; Decision trees; State of the art; Optical character recognition</ED>
<SD>Búsqueda documental; Estructura documental; Estado actual</SD>
<LO>INIST-21760.354000174683810320</LO>
<ID>10-0429692</ID>
</server>
</inist>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000164 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000164 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien
|wiki= Ticri/CIDE
|area= OcrV1
|flux= PascalFrancis
|étape= Corpus
|type= RBID
|clé= Pascal:10-0429692
|texte= Improved CHAID Algorithm for Document Structure Modelling
}}
| This area was generated with Dilib version V0.6.32. Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024 | |