Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The effects of OCR error on the extraction of private information

Identifieur interne : 000483 ( PascalFrancis/Curation ); précédent : 000482; suivant : 000484

The effects of OCR error on the extraction of private information

Auteurs : Kazem Taghva [États-Unis] ; Russell Beckley [États-Unis] ; Jeffrey Coombs [États-Unis]

Source :

RBID : Pascal:08-0029057

Descripteurs français

English descriptors

Abstract

OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.
pA  
A01 01  1    @0 0302-9743
A05       @2 3872
A08 01  1  ENG  @1 The effects of OCR error on the extraction of private information
A09 01  1  ENG  @1 Document analysis systems VII : 7th international workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006 : proceedings
A11 01  1    @1 TAGHVA (Kazem)
A11 02  1    @1 BECKLEY (Russell)
A11 03  1    @1 COOMBS (Jeffrey)
A12 01  1    @1 BUNKE (Horst) @9 ed.
A12 02  1    @1 SPITZ (A. Lawrence) @9 ed.
A14 01      @1 Information Science Research Institute, University of Nevada @2 Las Vegas @3 USA @Z 1 aut. @Z 2 aut. @Z 3 aut.
A20       @1 348-357
A21       @1 2006
A23 01      @0 ENG
A26 01      @0 3-540-32140-3
A43 01      @1 INIST @2 16343 @5 354000153630480310
A44       @0 0000 @1 © 2008 INIST-CNRS. All rights reserved.
A45       @0 18 ref.
A47 01  1    @0 08-0029057
A60       @1 P @2 C
A61       @0 A
A64 01  1    @0 Lecture notes in computer science
A66 01      @0 DEU
C01 01    ENG  @0 OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.
C02 01  X    @0 001D02B07C
C02 02  X    @0 001D02C04
C02 03  X    @0 001D02C03
C03 01  X  FRE  @0 Traitement image @5 01
C03 01  X  ENG  @0 Image processing @5 01
C03 01  X  SPA  @0 Procesamiento imagen @5 01
C03 02  X  FRE  @0 Reconnaissance forme @5 02
C03 02  X  ENG  @0 Pattern recognition @5 02
C03 02  X  SPA  @0 Reconocimiento patrón @5 02
C03 03  X  FRE  @0 Analyse documentaire @5 03
C03 03  X  ENG  @0 Document analysis @5 03
C03 03  X  SPA  @0 Análisis documental @5 03
C03 04  X  FRE  @0 Structure document @5 04
C03 04  X  ENG  @0 Document structure @5 04
C03 04  X  SPA  @0 Estructura documental @5 04
C03 05  X  FRE  @0 Reconnaissance caractère @5 06
C03 05  X  ENG  @0 Character recognition @5 06
C03 05  X  SPA  @0 Reconocimiento carácter @5 06
C03 06  X  FRE  @0 Reconnaissance optique caractère @5 07
C03 06  X  ENG  @0 Optical character recognition @5 07
C03 06  X  SPA  @0 Reconocimento óptico de caracteres @5 07
C03 07  X  FRE  @0 Extraction information @5 08
C03 07  X  ENG  @0 Information extraction @5 08
C03 07  X  SPA  @0 Extracción información @5 08
C03 08  X  FRE  @0 Vie privée @5 09
C03 08  X  ENG  @0 Private life @5 09
C03 08  X  SPA  @0 Vida privada @5 09
C03 09  X  FRE  @0 Linguistique @5 10
C03 09  X  ENG  @0 Linguistics @5 10
C03 09  X  SPA  @0 Linguística @5 10
C03 10  X  FRE  @0 Texte @5 11
C03 10  X  ENG  @0 Text @5 11
C03 10  X  SPA  @0 Texto @5 11
C03 11  X  FRE  @0 Classification @5 12
C03 11  X  ENG  @0 Classification @5 12
C03 11  X  SPA  @0 Clasificación @5 12
N21       @1 052
N44 01      @1 OTO
N82       @1 OTO
pR  
A30 01  1  ENG  @1 DAS 2006 @2 7 @3 Nelson NZL @4 2006

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:08-0029057

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">The effects of OCR error on the extraction of private information</title>
<author>
<name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Beckley, Russell" sort="Beckley, Russell" uniqKey="Beckley R" first="Russell" last="Beckley">Russell Beckley</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Coombs, Jeffrey" sort="Coombs, Jeffrey" uniqKey="Coombs J" first="Jeffrey" last="Coombs">Jeffrey Coombs</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">08-0029057</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 08-0029057 INIST</idno>
<idno type="RBID">Pascal:08-0029057</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000301</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000483</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">The effects of OCR error on the extraction of private information</title>
<author>
<name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Beckley, Russell" sort="Beckley, Russell" uniqKey="Beckley R" first="Russell" last="Beckley">Russell Beckley</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author>
<name sortKey="Coombs, Jeffrey" sort="Coombs, Jeffrey" uniqKey="Coombs J" first="Jeffrey" last="Coombs">Jeffrey Coombs</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
<imprint>
<date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Classification</term>
<term>Document analysis</term>
<term>Document structure</term>
<term>Image processing</term>
<term>Information extraction</term>
<term>Linguistics</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Private life</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Traitement image</term>
<term>Reconnaissance forme</term>
<term>Analyse documentaire</term>
<term>Structure document</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Extraction information</term>
<term>Vie privée</term>
<term>Linguistique</term>
<term>Texte</term>
<term>Classification</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Linguistique</term>
<term>Classification</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0302-9743</s0>
</fA01>
<fA05>
<s2>3872</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG">
<s1>The effects of OCR error on the extraction of private information</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>Document analysis systems VII : 7th international workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006 : proceedings</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>TAGHVA (Kazem)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>BECKLEY (Russell)</s1>
</fA11>
<fA11 i1="03" i2="1">
<s1>COOMBS (Jeffrey)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>BUNKE (Horst)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>SPITZ (A. Lawrence)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA20>
<s1>348-357</s1>
</fA20>
<fA21>
<s1>2006</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA26 i1="01">
<s0>3-540-32140-3</s0>
</fA26>
<fA43 i1="01">
<s1>INIST</s1>
<s2>16343</s2>
<s5>354000153630480310</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2008 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>18 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>08-0029057</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Lecture notes in computer science</s0>
</fA64>
<fA66 i1="01">
<s0>DEU</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>001D02B07C</s0>
</fC02>
<fC02 i1="02" i2="X">
<s0>001D02C04</s0>
</fC02>
<fC02 i1="03" i2="X">
<s0>001D02C03</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Traitement image</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Image processing</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Procesamiento imagen</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Reconnaissance forme</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Pattern recognition</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Reconocimiento patrón</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Analyse documentaire</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Document analysis</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Análisis documental</s0>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Structure document</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Document structure</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Estructura documental</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Reconnaissance caractère</s0>
<s5>06</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Character recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Reconocimiento carácter</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>07</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>07</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE">
<s0>Extraction information</s0>
<s5>08</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG">
<s0>Information extraction</s0>
<s5>08</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA">
<s0>Extracción información</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE">
<s0>Vie privée</s0>
<s5>09</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG">
<s0>Private life</s0>
<s5>09</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA">
<s0>Vida privada</s0>
<s5>09</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE">
<s0>Linguistique</s0>
<s5>10</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG">
<s0>Linguistics</s0>
<s5>10</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA">
<s0>Linguística</s0>
<s5>10</s5>
</fC03>
<fC03 i1="10" i2="X" l="FRE">
<s0>Texte</s0>
<s5>11</s5>
</fC03>
<fC03 i1="10" i2="X" l="ENG">
<s0>Text</s0>
<s5>11</s5>
</fC03>
<fC03 i1="10" i2="X" l="SPA">
<s0>Texto</s0>
<s5>11</s5>
</fC03>
<fC03 i1="11" i2="X" l="FRE">
<s0>Classification</s0>
<s5>12</s5>
</fC03>
<fC03 i1="11" i2="X" l="ENG">
<s0>Classification</s0>
<s5>12</s5>
</fC03>
<fC03 i1="11" i2="X" l="SPA">
<s0>Clasificación</s0>
<s5>12</s5>
</fC03>
<fN21>
<s1>052</s1>
</fN21>
<fN44 i1="01">
<s1>OTO</s1>
</fN44>
<fN82>
<s1>OTO</s1>
</fN82>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>DAS 2006</s1>
<s2>7</s2>
<s3>Nelson NZL</s3>
<s4>2006</s4>
</fA30>
</pR>
</standard>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000483 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000483 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Curation
   |type=    RBID
   |clé=     Pascal:08-0029057
   |texte=   The effects of OCR error on the extraction of private information
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024