OcrV1, PascalFrancis, Corpus, bibRecord, 000133

Unsupervised Method to Generate Page Templates

Identifieur interne : 000133 ( PascalFrancis/Corpus ); précédent : 000132; suivant : 000134

Unsupervised Method to Generate Page Templates

Auteurs : Herve Dejean

Source :

Proceedings of SPIE, the International Society for Optical Engineering [ 0277-786X ] ; 2011.

RBID : Pascal:11-0279160

Descripteurs français

Pascal (Inist)
- Imagerie, Classification non supervisée, Couverture, Correction erreur, Analyse documentaire, Reconnaissance optique caractère, Poursuite cible, 0130C, 4230.

English descriptors

KwdEn :
- Coverage, Document analysis, Error correction, Imagery, Optical character recognition, Target tracking, Unsupervised classification.

Abstract

In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

A01	`01`	`1`		`@0 0277-786X`
A02	`01`			`@0 PSISDG`
A03		`1`		`@0 Proc. SPIE Int. Soc. Opt. Eng.`
A05				`@2 7874`
A08	`01`	`1`	`ENG`	`@1 Unsupervised Method to Generate Page Templates`
A09	`01`	`1`	`ENG`	`@1 DOcument recognition and retrieval XVIII : 26-27 January 2011, San Francisco, California, United States`
A11	`01`	`1`		`@1 DEJEAN (Herve)`
A12	`01`	`1`		`@1 AGAM (Gady) @9 ed.`
A12	`02`	`1`		`@1 VIARD-GAUDIN (Christian) @9 ed.`
A14	`01`			`@1 Xerox Research Centre @3 EUR @Z 1 aut.`
A18	`01`	`1`		`@1 SPIE @3 USA @9 org-cong.`
A20				`@2 78740M.1-78740M.10`
A21				`@1 2011`
A23	`01`			`@0 ENG`
A25	`01`			`@1 SPIE @2 Bellingham WA`
A26	`01`			`@0 978-0-8194-8411-6`
A43	`01`			`@1 INIST @2 21760 @5 354000174732580200`
A44				`@0 0000 @1 © 2011 INIST-CNRS. All rights reserved.`
A45				`@0 12 ref.`
A47	`01`	`1`		`@0 11-0279160`
A60				`@1 P @2 C`
A61				`@0 A`
A64	`01`	`1`		`@0 Proceedings of SPIE, the International Society for Optical Engineering`
A66	`01`			`@0 USA`
C01	`01`		`ENG`	@0 In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.
C02	`01`	`3`		`@0 001B00A30C`
C02	`02`	`3`		`@0 001B40B30`
C03	`01`	`X`	`FRE`	`@0 Imagerie @5 19`
C03	`01`	`X`	`ENG`	`@0 Imagery @5 19`
C03	`01`	`X`	`SPA`	`@0 Imaginería @5 19`
C03	`02`	`X`	`FRE`	`@0 Classification non supervisée @5 61`
C03	`02`	`X`	`ENG`	`@0 Unsupervised classification @5 61`
C03	`02`	`X`	`SPA`	`@0 Clasificación no supervisada @5 61`
C03	`03`	`X`	`FRE`	`@0 Couverture @5 62`
C03	`03`	`X`	`ENG`	`@0 Coverage @5 62`
C03	`03`	`X`	`SPA`	`@0 Cobertura @5 62`
C03	`04`	`3`	`FRE`	`@0 Correction erreur @5 63`
C03	`04`	`3`	`ENG`	`@0 Error correction @5 63`
C03	`05`	`X`	`FRE`	`@0 Analyse documentaire @5 64`
C03	`05`	`X`	`ENG`	`@0 Document analysis @5 64`
C03	`05`	`X`	`SPA`	`@0 Análisis documental @5 64`
C03	`06`	`3`	`FRE`	`@0 Reconnaissance optique caractère @5 65`
C03	`06`	`3`	`ENG`	`@0 Optical character recognition @5 65`
C03	`07`	`3`	`FRE`	`@0 Poursuite cible @5 66`
C03	`07`	`3`	`ENG`	`@0 Target tracking @5 66`
C03	`08`	`3`	`FRE`	`@0 0130C @4 INC @5 83`
C03	`09`	`3`	`FRE`	`@0 4230 @4 INC @5 84`
N21				`@1 185`
N44	`01`			`@1 OTO`
N82				`@1 OTO`

A30	`01`	`1`	`ENG`	`@1 Electronic Imaging Science and Technology Symposium @3 San Francisco CA USA @4 2010`

Format Inist (serveur)

NO :	PASCAL 11-0279160 INIST
ET :	Unsupervised Method to Generate Page Templates
AU :	DEJEAN (Herve); AGAM (Gady); VIARD-GAUDIN (Christian)
AF :	Xerox Research Centre/Europe (1 aut.)
DT :	Publication en série; Congrès; Niveau analytique
SO :	Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2011; Vol. 7874; 78740M.1-78740M.10; Bibl. 12 ref.
LA :	Anglais
EA :	In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.
CC :	001B00A30C; 001B40B30
FD :	Imagerie; Classification non supervisée; Couverture; Correction erreur; Analyse documentaire; Reconnaissance optique caractère; Poursuite cible; 0130C; 4230
ED :	Imagery; Unsupervised classification; Coverage; Error correction; Document analysis; Optical character recognition; Target tracking
SD :	Imaginería; Clasificación no supervisada; Cobertura; Análisis documental
LO :	INIST-21760.354000174732580200
ID :	11-0279160

Links to Exploration step

Pascal:11-0279160

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Unsupervised Method to Generate Page Templates</title>
<author><name sortKey="Dejean, Herve" sort="Dejean, Herve" uniqKey="Dejean H" first="Herve" last="Dejean">Herve Dejean</name>
<affiliation><inist:fA14 i1="01"><s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">11-0279160</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0279160 INIST</idno>
<idno type="RBID">Pascal:11-0279160</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000133</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Unsupervised Method to Generate Page Templates</title>
<author><name sortKey="Dejean, Herve" sort="Dejean, Herve" uniqKey="Dejean H" first="Herve" last="Dejean">Herve Dejean</name>
<affiliation><inist:fA14 i1="01"><s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Coverage</term>
<term>Document analysis</term>
<term>Error correction</term>
<term>Imagery</term>
<term>Optical character recognition</term>
<term>Target tracking</term>
<term>Unsupervised classification</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Imagerie</term>
<term>Classification non supervisée</term>
<term>Couverture</term>
<term>Correction erreur</term>
<term>Analyse documentaire</term>
<term>Reconnaissance optique caractère</term>
<term>Poursuite cible</term>
<term>0130C</term>
<term>4230</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>0277-786X</s0>
</fA01>
<fA02 i1="01"><s0>PSISDG</s0>
</fA02>
<fA03 i2="1"><s0>Proc. SPIE Int. Soc. Opt. Eng.</s0>
</fA03>
<fA05><s2>7874</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG"><s1>Unsupervised Method to Generate Page Templates</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>DOcument recognition and retrieval XVIII : 26-27 January 2011, San Francisco, California, United States</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>DEJEAN (Herve)</s1>
</fA11>
<fA12 i1="01" i2="1"><s1>AGAM (Gady)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1"><s1>VIARD-GAUDIN (Christian)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01"><s1>Xerox Research Centre</s1>
<s3>EUR</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1"><s1>SPIE</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20><s2>78740M.1-78740M.10</s2>
</fA20>
<fA21><s1>2011</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA25 i1="01"><s1>SPIE</s1>
<s2>Bellingham WA</s2>
</fA25>
<fA26 i1="01"><s0>978-0-8194-8411-6</s0>
</fA26>
<fA43 i1="01"><s1>INIST</s1>
<s2>21760</s2>
<s5>354000174732580200</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2011 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>12 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>11-0279160</s0>
</fA47>
<fA60><s1>P</s1>
<s2>C</s2>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA64 i1="01" i2="1"><s0>Proceedings of SPIE, the International Society for Optical Engineering</s0>
</fA64>
<fA66 i1="01"><s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</s0>
</fC01>
<fC02 i1="01" i2="3"><s0>001B00A30C</s0>
</fC02>
<fC02 i1="02" i2="3"><s0>001B40B30</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Imagerie</s0>
<s5>19</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Imagery</s0>
<s5>19</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Imaginería</s0>
<s5>19</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Classification non supervisée</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Unsupervised classification</s0>
<s5>61</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Clasificación no supervisada</s0>
<s5>61</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Couverture</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Coverage</s0>
<s5>62</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Cobertura</s0>
<s5>62</s5>
</fC03>
<fC03 i1="04" i2="3" l="FRE"><s0>Correction erreur</s0>
<s5>63</s5>
</fC03>
<fC03 i1="04" i2="3" l="ENG"><s0>Error correction</s0>
<s5>63</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Analyse documentaire</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Document analysis</s0>
<s5>64</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Análisis documental</s0>
<s5>64</s5>
</fC03>
<fC03 i1="06" i2="3" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>65</s5>
</fC03>
<fC03 i1="06" i2="3" l="ENG"><s0>Optical character recognition</s0>
<s5>65</s5>
</fC03>
<fC03 i1="07" i2="3" l="FRE"><s0>Poursuite cible</s0>
<s5>66</s5>
</fC03>
<fC03 i1="07" i2="3" l="ENG"><s0>Target tracking</s0>
<s5>66</s5>
</fC03>
<fC03 i1="08" i2="3" l="FRE"><s0>0130C</s0>
<s4>INC</s4>
<s5>83</s5>
</fC03>
<fC03 i1="09" i2="3" l="FRE"><s0>4230</s0>
<s4>INC</s4>
<s5>84</s5>
</fC03>
<fN21><s1>185</s1>
</fN21>
<fN44 i1="01"><s1>OTO</s1>
</fN44>
<fN82><s1>OTO</s1>
</fN82>
</pA>
<pR><fA30 i1="01" i2="1" l="ENG"><s1>Electronic Imaging Science and Technology Symposium</s1>
<s3>San Francisco CA USA</s3>
<s4>2010</s4>
</fA30>
</pR>
</standard>
<server><NO>PASCAL 11-0279160 INIST</NO>
<ET>Unsupervised Method to Generate Page Templates</ET>
<AU>DEJEAN (Herve); AGAM (Gady); VIARD-GAUDIN (Christian)</AU>
<AF>Xerox Research Centre/Europe (1 aut.)</AF>
<DT>Publication en série; Congrès; Niveau analytique</DT>
<SO>Proceedings of SPIE, the International Society for Optical Engineering; ISSN 0277-786X; Coden PSISDG; Etats-Unis; Da. 2011; Vol. 7874; 78740M.1-78740M.10; Bibl. 12 ref.</SO>
<LA>Anglais</LA>
<EA>In this paper, we propose a method for automatically inferring the different page templates used to layout the document content. The first step of the method consists in performing a logical analysis of the document. Depending of the coverage of this step, a given number of document elements will be labeled. Then geometric relations are computed between these labeled elements, and page templates candidates are generated using frequent related elements. A fuzzy matching operation allows for selecting the most frequent and relevant page templates for a given document. Such page templates can be used to correct errors produced during the different previous steps of the document analysis: zoning, OCR, and logical analysis. Evaluation has been performed using the INEX book track collection.</EA>
<CC>001B00A30C; 001B40B30</CC>
<FD>Imagerie; Classification non supervisée; Couverture; Correction erreur; Analyse documentaire; Reconnaissance optique caractère; Poursuite cible; 0130C; 4230</FD>
<ED>Imagery; Unsupervised classification; Coverage; Error correction; Document analysis; Optical character recognition; Target tracking</ED>
<SD>Imaginería; Clasificación no supervisada; Cobertura; Análisis documental</SD>
<LO>INIST-21760.354000174732580200</LO>
<ID>11-0279160</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000133 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000133 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:11-0279160
   |texte=   Unsupervised Method to Generate Page Templates
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Unsupervised Method to Generate Page Templates

Unsupervised Method to Generate Page Templates

Source :

Descripteurs français

English descriptors

Abstract

Notice en format standard (ISO 2709)

Format Inist (serveur)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri