Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Using the Web to validate document recognition results : Experiments with business cards

Identifieur interne : 000369 ( PascalFrancis/Checkpoint ); précédent : 000368; suivant : 000370

Using the Web to validate document recognition results : Experiments with business cards

Auteurs : Clemens Oertel [Allemagne] ; Shauna O'Shea [Canada] ; Adam Bodnar [Canada] ; Dorothea Blostein [Canada]

Source :

RBID : Pascal:05-0360301

Descripteurs français

English descriptors

Abstract

The World Wide Web is a vast information resource which can be useful for validating the results produced by document recognizers. Three computational steps are involved, all of them challenging: (1) use the recognition results in a Web search to retrieve Web pages that contain information similar to that in the document, (2) identify the relevant portions of the retrieved Web pages, and (3) analyze these relevant portions to determine what corrections (if any) should be made to the recognition result. We have conducted exploratory implementations of steps (1) and (2) in the business-card domain: we use fields of the business card to retrieve Web pages and identify the most relevant portions of those Web pages. In some cases, this information appears suitable for correcting OCR errors in the business card fields. In other cases, the approach fails due to stale information: when business cards are several years old and the business-card holder has changed jobs, then websites (such as the home page or company website) no longer contain information matching that on the business card. Our exploratory results indicate that in some domains it may be possible to develop effective means of querying the Web with recognition results, and to use this information to correct the recognition results and/or detect that the information is stale.


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:05-0360301

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Using the Web to validate document recognition results : Experiments with business cards</title>
<author>
<name sortKey="Oertel, Clemens" sort="Oertel, Clemens" uniqKey="Oertel C" first="Clemens" last="Oertel">Clemens Oertel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Bioinformatics, Computer and Cognitive Science University of Tübingen</s1>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<wicri:noRegion>Computer and Cognitive Science University of Tübingen</wicri:noRegion>
<wicri:noRegion>Bioinformatics, Computer and Cognitive Science University of Tübingen</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="O Shea, Shauna" sort="O Shea, Shauna" uniqKey="O Shea S" first="Shauna" last="O'Shea">Shauna O'Shea</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>School of Computing Queen's University</s1>
<s2>Kingston, Ontario</s2>
<s3>CAN</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>School of Computing Queen's University</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Bodnar, Adam" sort="Bodnar, Adam" uniqKey="Bodnar A" first="Adam" last="Bodnar">Adam Bodnar</name>
<affiliation wicri:level="1">
<inist:fA14 i1="03">
<s1>Dept. Computer Science University of British Columbia</s1>
<s2>Vancouver, BC</s2>
<s3>CAN</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>Dept. Computer Science University of British Columbia</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Blostein, Dorothea" sort="Blostein, Dorothea" uniqKey="Blostein D" first="Dorothea" last="Blostein">Dorothea Blostein</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>School of Computing Queen's University</s1>
<s2>Kingston, Ontario</s2>
<s3>CAN</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>School of Computing Queen's University</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">05-0360301</idno>
<date when="2005">2005</date>
<idno type="stanalyst">PASCAL 05-0360301 INIST</idno>
<idno type="RBID">Pascal:05-0360301</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000460</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000328</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000369</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Using the Web to validate document recognition results : Experiments with business cards</title>
<author>
<name sortKey="Oertel, Clemens" sort="Oertel, Clemens" uniqKey="Oertel C" first="Clemens" last="Oertel">Clemens Oertel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Bioinformatics, Computer and Cognitive Science University of Tübingen</s1>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Allemagne</country>
<wicri:noRegion>Computer and Cognitive Science University of Tübingen</wicri:noRegion>
<wicri:noRegion>Bioinformatics, Computer and Cognitive Science University of Tübingen</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="O Shea, Shauna" sort="O Shea, Shauna" uniqKey="O Shea S" first="Shauna" last="O'Shea">Shauna O'Shea</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>School of Computing Queen's University</s1>
<s2>Kingston, Ontario</s2>
<s3>CAN</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>School of Computing Queen's University</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Bodnar, Adam" sort="Bodnar, Adam" uniqKey="Bodnar A" first="Adam" last="Bodnar">Adam Bodnar</name>
<affiliation wicri:level="1">
<inist:fA14 i1="03">
<s1>Dept. Computer Science University of British Columbia</s1>
<s2>Vancouver, BC</s2>
<s3>CAN</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>Dept. Computer Science University of British Columbia</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Blostein, Dorothea" sort="Blostein, Dorothea" uniqKey="Blostein D" first="Dorothea" last="Blostein">Dorothea Blostein</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>School of Computing Queen's University</s1>
<s2>Kingston, Ontario</s2>
<s3>CAN</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>School of Computing Queen's University</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint>
<date when="2005">2005</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Error correction</term>
<term>Information retrieval</term>
<term>Optical character recognition</term>
<term>Validation</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Réseau web</term>
<term>Correction erreur</term>
<term>Reconnaissance optique caractère</term>
<term>Validation</term>
<term>Recherche information</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The World Wide Web is a vast information resource which can be useful for validating the results produced by document recognizers. Three computational steps are involved, all of them challenging: (1) use the recognition results in a Web search to retrieve Web pages that contain information similar to that in the document, (2) identify the relevant portions of the retrieved Web pages, and (3) analyze these relevant portions to determine what corrections (if any) should be made to the recognition result. We have conducted exploratory implementations of steps (1) and (2) in the business-card domain: we use fields of the business card to retrieve Web pages and identify the most relevant portions of those Web pages. In some cases, this information appears suitable for correcting OCR errors in the business card fields. In other cases, the approach fails due to stale information: when business cards are several years old and the business-card holder has changed jobs, then websites (such as the home page or company website) no longer contain information matching that on the business card. Our exploratory results indicate that in some domains it may be possible to develop effective means of querying the Web with recognition results, and to use this information to correct the recognition results and/or detect that the information is stale.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>1017-2653</s0>
</fA01>
<fA05>
<s2>5676</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>Using the Web to validate document recognition results : Experiments with business cards</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>Document recognition and retrieval XII : San Jose CA, 19-20 January 2005</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>OERTEL (Clemens)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>O'SHEA (Shauna)</s1>
</fA11>
<fA11 i1="03" i2="1">
<s1>BODNAR (Adam)</s1>
</fA11>
<fA11 i1="04" i2="1">
<s1>BLOSTEIN (Dorothea)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>SMITH (Elisa H. Barney)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>TAGHVA (Kazem)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>Bioinformatics, Computer and Cognitive Science University of Tübingen</s1>
<s3>DEU</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA14 i1="02">
<s1>School of Computing Queen's University</s1>
<s2>Kingston, Ontario</s2>
<s3>CAN</s3>
<sZ>2 aut.</sZ>
<sZ>4 aut.</sZ>
</fA14>
<fA14 i1="03">
<s1>Dept. Computer Science University of British Columbia</s1>
<s2>Vancouver, BC</s2>
<s3>CAN</s3>
<sZ>3 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1">
<s1>International Society for Optical Engineering</s1>
<s2>Bellingham WA</s2>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20>
<s1>17-27</s1>
</fA20>
<fA21>
<s1>2005</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA26 i1="01">
<s0>0-8194-5649-7</s0>
</fA26>
<fA43 i1="01">
<s1>INIST</s1>
<s2>21760</s2>
<s5>354000124499720030</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2005 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>21 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>05-0360301</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>SPIE proceedings series</s0>
</fA64>
<fA66 i1="01">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>The World Wide Web is a vast information resource which can be useful for validating the results produced by document recognizers. Three computational steps are involved, all of them challenging: (1) use the recognition results in a Web search to retrieve Web pages that contain information similar to that in the document, (2) identify the relevant portions of the retrieved Web pages, and (3) analyze these relevant portions to determine what corrections (if any) should be made to the recognition result. We have conducted exploratory implementations of steps (1) and (2) in the business-card domain: we use fields of the business card to retrieve Web pages and identify the most relevant portions of those Web pages. In some cases, this information appears suitable for correcting OCR errors in the business card fields. In other cases, the approach fails due to stale information: when business cards are several years old and the business-card holder has changed jobs, then websites (such as the home page or company website) no longer contain information matching that on the business card. Our exploratory results indicate that in some domains it may be possible to develop effective means of querying the Web with recognition results, and to use this information to correct the recognition results and/or detect that the information is stale.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>001D04A05A</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Réseau web</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>World wide web</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Red WWW</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Correction erreur</s0>
<s5>03</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Error correction</s0>
<s5>03</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Corrección error</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>04</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>04</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Validation</s0>
<s5>05</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Validation</s0>
<s5>05</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Validación</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Recherche information</s0>
<s5>06</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Information retrieval</s0>
<s5>06</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Búsqueda información</s0>
<s5>06</s5>
</fC03>
<fN21>
<s1>248</s1>
</fN21>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>Document recognition and retrieval. Conference</s1>
<s2>12</s2>
<s3>San Jose CA USA</s3>
<s4>2005-01-19</s4>
</fA30>
</pR>
</standard>
</inist>
<affiliations>
<list>
<country>
<li>Allemagne</li>
<li>Canada</li>
</country>
</list>
<tree>
<country name="Allemagne">
<noRegion>
<name sortKey="Oertel, Clemens" sort="Oertel, Clemens" uniqKey="Oertel C" first="Clemens" last="Oertel">Clemens Oertel</name>
</noRegion>
</country>
<country name="Canada">
<noRegion>
<name sortKey="O Shea, Shauna" sort="O Shea, Shauna" uniqKey="O Shea S" first="Shauna" last="O'Shea">Shauna O'Shea</name>
</noRegion>
<name sortKey="Blostein, Dorothea" sort="Blostein, Dorothea" uniqKey="Blostein D" first="Dorothea" last="Blostein">Dorothea Blostein</name>
<name sortKey="Bodnar, Adam" sort="Bodnar, Adam" uniqKey="Bodnar A" first="Adam" last="Bodnar">Adam Bodnar</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000369 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Checkpoint/biblio.hfd -nk 000369 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Checkpoint
   |type=    RBID
   |clé=     Pascal:05-0360301
   |texte=   Using the Web to validate document recognition results : Experiments with business cards
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024