Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Unsupervised Method to Generate Page Templates

Identifieur interne : 000640 ( PascalFrancis/Curation ); précédent : 000639; suivant : 000641

Unsupervised Method to Generate Page Templates

Auteurs : Herve Dejean

Source :

RBID : Pascal:11-0279160

Descripteurs français

English descriptors

Abstract

In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.
pA  
A01 01  1    @0 0277-786X
A02 01      @0 PSISDG
A03   1    @0 Proc. SPIE Int. Soc. Opt. Eng.
A05       @2 7874
A08 01  1  ENG  @1 Unsupervised Method to Generate Page Templates
A09 01  1  ENG  @1 DOcument recognition and retrieval XVIII : 26-27 January 2011, San Francisco, California, United States
A11 01  1    @1 DEJEAN (Herve)
A12 01  1    @1 AGAM (Gady) @9 ed.
A12 02  1    @1 VIARD-GAUDIN (Christian) @9 ed.
A14 01      @1 Xerox Research Centre @3 EUR @Z 1 aut.
A18 01  1    @1 SPIE @3 USA @9 org-cong.
A20       @2 78740M.1-78740M.10
A21       @1 2011
A23 01      @0 ENG
A25 01      @1 SPIE @2 Bellingham WA
A26 01      @0 978-0-8194-8411-6
A43 01      @1 INIST @2 21760 @5 354000174732580200
A44       @0 0000 @1 © 2011 INIST-CNRS. All rights reserved.
A45       @0 12 ref.
A47 01  1    @0 11-0279160
A60       @1 P @2 C
A61       @0 A
A64 01  1    @0 Proceedings of SPIE, the International Society for Optical Engineering
A66 01      @0 USA
C01 01    ENG  @0 In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.
C02 01  3    @0 001B00A30C
C02 02  3    @0 001B40B30
C03 01  X  FRE  @0 Imagerie @5 19
C03 01  X  ENG  @0 Imagery @5 19
C03 01  X  SPA  @0 Imaginería @5 19
C03 02  X  FRE  @0 Classification non supervisée @5 61
C03 02  X  ENG  @0 Unsupervised classification @5 61
C03 02  X  SPA  @0 Clasificación no supervisada @5 61
C03 03  X  FRE  @0 Couverture @5 62
C03 03  X  ENG  @0 Coverage @5 62
C03 03  X  SPA  @0 Cobertura @5 62
C03 04  3  FRE  @0 Correction erreur @5 63
C03 04  3  ENG  @0 Error correction @5 63
C03 05  X  FRE  @0 Analyse documentaire @5 64
C03 05  X  ENG  @0 Document analysis @5 64
C03 05  X  SPA  @0 Análisis documental @5 64
C03 06  3  FRE  @0 Reconnaissance optique caractère @5 65
C03 06  3  ENG  @0 Optical character recognition @5 65
C03 07  3  FRE  @0 Poursuite cible @5 66
C03 07  3  ENG  @0 Target tracking @5 66
C03 08  3  FRE  @0 0130C @4 INC @5 83
C03 09  3  FRE  @0 4230 @4 INC @5 84
N21       @1 185
N44 01      @1 OTO
N82       @1 OTO
pR  
A30 01  1  ENG  @1 Electronic Imaging Science and Technology Symposium @3 San Francisco CA USA @4 2010

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:11-0279160

Curation

No country items

Herve Dejean
<affiliation>
<inist:fA14 i1="01">
<s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<wicri:noCountry>EUR</wicri:noCountry>
</affiliation>

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Unsupervised Method to Generate Page Templates</title>
<author>
<name sortKey="Dejean, Herve" sort="Dejean, Herve" uniqKey="Dejean H" first="Herve" last="Dejean">Herve Dejean</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<wicri:noCountry>EUR</wicri:noCountry>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">11-0279160</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0279160 INIST</idno>
<idno type="RBID">Pascal:11-0279160</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000133</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000640</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Unsupervised Method to Generate Page Templates</title>
<author>
<name sortKey="Dejean, Herve" sort="Dejean, Herve" uniqKey="Dejean H" first="Herve" last="Dejean">Herve Dejean</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<wicri:noCountry>EUR</wicri:noCountry>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Coverage</term>
<term>Document analysis</term>
<term>Error correction</term>
<term>Imagery</term>
<term>Optical character recognition</term>
<term>Target tracking</term>
<term>Unsupervised classification</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Imagerie</term>
<term>Classification non supervisée</term>
<term>Couverture</term>
<term>Correction erreur</term>
<term>Analyse documentaire</term>
<term>Reconnaissance optique caractère</term>
<term>Poursuite cible</term>
<term>0130C</term>
<term>4230</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0277-786X</s0>
</fA01>
<fA02 i1="01">
<s0>PSISDG</s0>
</fA02>
<fA03 i2="1">
<s0>Proc. SPIE Int. Soc. Opt. Eng.</s0>
</fA03>
<fA05>
<s2>7874</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>Unsupervised Method to Generate Page Templates</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>DOcument recognition and retrieval XVIII : 26-27 January 2011, San Francisco, California, United States</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>DEJEAN (Herve)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>AGAM (Gady)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>VIARD-GAUDIN (Christian)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1">
<s1>SPIE</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20>
<s2>78740M.1-78740M.10</s2>
</fA20>
<fA21>
<s1>2011</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA25 i1="01">
<s1>SPIE</s1>
<s2>Bellingham WA</s2>
</fA25>
<fA26 i1="01">
<s0>978-0-8194-8411-6</s0>
</fA26>
<fA43 i1="01">
<s1>INIST</s1>
<s2>21760</s2>
<s5>354000174732580200</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2011 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>12 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>11-0279160</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Proceedings of SPIE, the International Society for Optical Engineering</s0>
</fA64>
<fA66 i1="01">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</s0>
</fC01>
<fC02 i1="01" i2="3">
<s0>001B00A30C</s0>
</fC02>
<fC02 i1="02" i2="3">
<s0>001B40B30</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Imagerie</s0>
<s5>19</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Imagery</s0>
<s5>19</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Imaginería</s0>
<s5>19</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Classification non supervisée</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Unsupervised classification</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Clasificación no supervisada</s0>
<s5>61</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Couverture</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Coverage</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Cobertura</s0>
<s5>62</s5>
</fC03>
<fC03 i1="04" i2="3" l="FRE">
<s0>Correction erreur</s0>
<s5>63</s5>
</fC03>
<fC03 i1="04" i2="3" l="ENG">
<s0>Error correction</s0>
<s5>63</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Analyse documentaire</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Document analysis</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Análisis documental</s0>
<s5>64</s5>
</fC03>
<fC03 i1="06" i2="3" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>65</s5>
</fC03>
<fC03 i1="06" i2="3" l="ENG">
<s0>Optical character recognition</s0>
<s5>65</s5>
</fC03>
<fC03 i1="07" i2="3" l="FRE">
<s0>Poursuite cible</s0>
<s5>66</s5>
</fC03>
<fC03 i1="07" i2="3" l="ENG">
<s0>Target tracking</s0>
<s5>66</s5>
</fC03>
<fC03 i1="08" i2="3" l="FRE">
<s0>0130C</s0>
<s4>INC</s4>
<s5>83</s5>
</fC03>
<fC03 i1="09" i2="3" l="FRE">
<s0>4230</s0>
<s4>INC</s4>
<s5>84</s5>
</fC03>
<fN21>
<s1>185</s1>
</fN21>
<fN44 i1="01">
<s1>OTO</s1>
</fN44>
<fN82>
<s1>OTO</s1>
</fN82>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>Electronic Imaging Science and Technology Symposium</s1>
<s3>San Francisco CA USA</s3>
<s4>2010</s4>
</fA30>
</pR>
</standard>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000640 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000640 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Curation
   |type=    RBID
   |clé=     Pascal:11-0279160
   |texte=   Unsupervised Method to Generate Page Templates
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024