Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Unsupervised Method to Generate Page Templates

Identifieur interne : 000133 ( PascalFrancis/Corpus ); précédent : 000132; suivant : 000134

Unsupervised Method to Generate Page Templates

Auteurs : Herve Dejean

Source :

RBID : Pascal:11-0279160

Descripteurs français

English descriptors

Abstract

In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

pA  
A01 01  1    @0 0277-786X
A02 01      @0 PSISDG
A03   1    @0 Proc. SPIE Int. Soc. Opt. Eng.
A05       @2 7874
A08 01  1  ENG  @1 Unsupervised Method to Generate Page Templates
A09 01  1  ENG  @1 DOcument recognition and retrieval XVIII : 26-27 January 2011, San Francisco, California, United States
A11 01  1    @1 DEJEAN (Herve)
A12 01  1    @1 AGAM (Gady) @9 ed.
A12 02  1    @1 VIARD-GAUDIN (Christian) @9 ed.
A14 01      @1 Xerox Research Centre @3 EUR @Z 1 aut.
A18 01  1    @1 SPIE @3 USA @9 org-cong.
A20       @2 78740M.1-78740M.10
A21       @1 2011
A23 01      @0 ENG
A25 01      @1 SPIE @2 Bellingham WA
A26 01      @0 978-0-8194-8411-6
A43 01      @1 INIST @2 21760 @5 354000174732580200
A44       @0 0000 @1 © 2011 INIST-CNRS. All rights reserved.
A45       @0 12 ref.
A47 01  1    @0 11-0279160
A60       @1 P @2 C
A61       @0 A
A64 01  1    @0 Proceedings of SPIE, the International Society for Optical Engineering
A66 01      @0 USA
C01 01    ENG  @0 In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.
C02 01  3    @0 001B00A30C
C02 02  3    @0 001B40B30
C03 01  X  FRE  @0 Imagerie @5 19
C03 01  X  ENG  @0 Imagery @5 19
C03 01  X  SPA  @0 Imaginería @5 19
C03 02  X  FRE  @0 Classification non supervisée @5 61
C03 02  X  ENG  @0 Unsupervised classification @5 61
C03 02  X  SPA  @0 Clasificación no supervisada @5 61
C03 03  X  FRE  @0 Couverture @5 62
C03 03  X  ENG  @0 Coverage @5 62
C03 03  X  SPA  @0 Cobertura @5 62
C03 04  3  FRE  @0 Correction erreur @5 63
C03 04  3  ENG  @0 Error correction @5 63
C03 05  X  FRE  @0 Analyse documentaire @5 64
C03 05  X  ENG  @0 Document analysis @5 64
C03 05  X  SPA  @0 Análisis documental @5 64
C03 06  3  FRE  @0 Reconnaissance optique caractère @5 65
C03 06  3  ENG  @0 Optical character recognition @5 65
C03 07  3  FRE  @0 Poursuite cible @5 66
C03 07  3  ENG  @0 Target tracking @5 66
C03 08  3  FRE  @0 0130C @4 INC @5 83
C03 09  3  FRE  @0 4230 @4 INC @5 84
N21       @1 185
N44 01      @1 OTO
N82       @1 OTO
pR  
A30 01  1  ENG  @1 Electronic Imaging Science and Technology Symposium @3 San Francisco CA USA @4 2010

Format Inist (serveur)

NO : PASCAL 11-0279160 INIST
ET : Unsupervised Method to Generate Page Templates
AU : DEJEAN (Herve); AGAM (Gady); VIARD-GAUDIN (Christian)
AF : Xerox Research Centre/Europe (1 aut.)
DT : Publication en série; Congrès; Niveau analytique
SO : Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2011; Vol. 7874; 78740M.1-78740M.10; Bibl. 12 ref.
LA : Anglais
EA : In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.
CC : 001B00A30C; 001B40B30
FD : Imagerie; Classification non supervisée; Couverture; Correction erreur; Analyse documentaire; Reconnaissance optique caractère; Poursuite cible; 0130C; 4230
ED : Imagery; Unsupervised classification; Coverage; Error correction; Document analysis; Optical character recognition; Target tracking
SD : Imaginería; Clasificación no supervisada; Cobertura; Análisis documental
LO : INIST-21760.354000174732580200
ID : 11-0279160

Links to Exploration step

Pascal:11-0279160

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Unsupervised Method to Generate Page Templates</title>
<author>
<name sortKey="Dejean, Herve" sort="Dejean, Herve" uniqKey="Dejean H" first="Herve" last="Dejean">Herve Dejean</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">11-0279160</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0279160 INIST</idno>
<idno type="RBID">Pascal:11-0279160</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000133</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Unsupervised Method to Generate Page Templates</title>
<author>
<name sortKey="Dejean, Herve" sort="Dejean, Herve" uniqKey="Dejean H" first="Herve" last="Dejean">Herve Dejean</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Coverage</term>
<term>Document analysis</term>
<term>Error correction</term>
<term>Imagery</term>
<term>Optical character recognition</term>
<term>Target tracking</term>
<term>Unsupervised classification</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Imagerie</term>
<term>Classification non supervisée</term>
<term>Couverture</term>
<term>Correction erreur</term>
<term>Analyse documentaire</term>
<term>Reconnaissance optique caractère</term>
<term>Poursuite cible</term>
<term>0130C</term>
<term>4230</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0277-786X</s0>
</fA01>
<fA02 i1="01">
<s0>PSISDG</s0>
</fA02>
<fA03 i2="1">
<s0>Proc. SPIE Int. Soc. Opt. Eng.</s0>
</fA03>
<fA05>
<s2>7874</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>Unsupervised Method to Generate Page Templates</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>DOcument recognition and retrieval XVIII : 26-27 January 2011, San Francisco, California, United States</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>DEJEAN (Herve)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>AGAM (Gady)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>VIARD-GAUDIN (Christian)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1">
<s1>SPIE</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20>
<s2>78740M.1-78740M.10</s2>
</fA20>
<fA21>
<s1>2011</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA25 i1="01">
<s1>SPIE</s1>
<s2>Bellingham WA</s2>
</fA25>
<fA26 i1="01">
<s0>978-0-8194-8411-6</s0>
</fA26>
<fA43 i1="01">
<s1>INIST</s1>
<s2>21760</s2>
<s5>354000174732580200</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2011 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>12 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>11-0279160</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Proceedings of SPIE, the International Society for Optical Engineering</s0>
</fA64>
<fA66 i1="01">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</s0>
</fC01>
<fC02 i1="01" i2="3">
<s0>001B00A30C</s0>
</fC02>
<fC02 i1="02" i2="3">
<s0>001B40B30</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Imagerie</s0>
<s5>19</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Imagery</s0>
<s5>19</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Imaginería</s0>
<s5>19</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Classification non supervisée</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Unsupervised classification</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Clasificación no supervisada</s0>
<s5>61</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Couverture</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Coverage</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Cobertura</s0>
<s5>62</s5>
</fC03>
<fC03 i1="04" i2="3" l="FRE">
<s0>Correction erreur</s0>
<s5>63</s5>
</fC03>
<fC03 i1="04" i2="3" l="ENG">
<s0>Error correction</s0>
<s5>63</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Analyse documentaire</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Document analysis</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Análisis documental</s0>
<s5>64</s5>
</fC03>
<fC03 i1="06" i2="3" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>65</s5>
</fC03>
<fC03 i1="06" i2="3" l="ENG">
<s0>Optical character recognition</s0>
<s5>65</s5>
</fC03>
<fC03 i1="07" i2="3" l="FRE">
<s0>Poursuite cible</s0>
<s5>66</s5>
</fC03>
<fC03 i1="07" i2="3" l="ENG">
<s0>Target tracking</s0>
<s5>66</s5>
</fC03>
<fC03 i1="08" i2="3" l="FRE">
<s0>0130C</s0>
<s4>INC</s4>
<s5>83</s5>
</fC03>
<fC03 i1="09" i2="3" l="FRE">
<s0>4230</s0>
<s4>INC</s4>
<s5>84</s5>
</fC03>
<fN21>
<s1>185</s1>
</fN21>
<fN44 i1="01">
<s1>OTO</s1>
</fN44>
<fN82>
<s1>OTO</s1>
</fN82>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>Electronic Imaging Science and Technology Symposium</s1>
<s3>San Francisco CA USA</s3>
<s4>2010</s4>
</fA30>
</pR>
</standard>
<server>
<NO>PASCAL 11-0279160 INIST</NO>
<ET>Unsupervised Method to Generate Page Templates</ET>
<AU>DEJEAN (Herve); AGAM (Gady); VIARD-GAUDIN (Christian)</AU>
<AF>Xerox Research Centre/Europe (1 aut.)</AF>
<DT>Publication en série; Congrès; Niveau analytique</DT>
<SO>Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2011; Vol. 7874; 78740M.1-78740M.10; Bibl. 12 ref.</SO>
<LA>Anglais</LA>
<EA>In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</EA>
<CC>001B00A30C; 001B40B30</CC>
<FD>Imagerie; Classification non supervisée; Couverture; Correction erreur; Analyse documentaire; Reconnaissance optique caractère; Poursuite cible; 0130C; 4230</FD>
<ED>Imagery; Unsupervised classification; Coverage; Error correction; Document analysis; Optical character recognition; Target tracking</ED>
<SD>Imaginería; Clasificación no supervisada; Cobertura; Análisis documental</SD>
<LO>INIST-21760.354000174732580200</LO>
<ID>11-0279160</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000133 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000133 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:11-0279160
   |texte=   Unsupervised Method to Generate Page Templates
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024