The effects of OCR error on the extraction of private information
Identifieur interne :
000483 ( PascalFrancis/Curation );
précédent :
000482;
suivant :
000484
The effects of OCR error on the extraction of private information
Auteurs : Kazem Taghva [
États-Unis] ;
Russell Beckley [
États-Unis] ;
Jeffrey Coombs [
États-Unis]
Source :
-
Lecture notes in computer science [ 0302-9743 ] ; 2006.
RBID : Pascal:08-0029057
Descripteurs français
- Pascal (Inist)
- Traitement image,
Reconnaissance forme,
Analyse documentaire,
Structure document,
Reconnaissance caractère,
Reconnaissance optique caractère,
Extraction information,
Vie privée,
Linguistique,
Texte,
Classification.
- Wicri :
English descriptors
- KwdEn :
- Character recognition,
Classification,
Document analysis,
Document structure,
Image processing,
Information extraction,
Linguistics,
Optical character recognition,
Pattern recognition,
Private life,
Text.
Abstract
OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.
pA |
A01 | 01 | 1 | | @0 0302-9743 |
---|
A05 | | | | @2 3872 |
---|
A08 | 01 | 1 | ENG | @1 The effects of OCR error on the extraction of private information |
---|
A09 | 01 | 1 | ENG | @1 Document analysis systems VII : 7th international workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006 : proceedings |
---|
A11 | 01 | 1 | | @1 TAGHVA (Kazem) |
---|
A11 | 02 | 1 | | @1 BECKLEY (Russell) |
---|
A11 | 03 | 1 | | @1 COOMBS (Jeffrey) |
---|
A12 | 01 | 1 | | @1 BUNKE (Horst) @9 ed. |
---|
A12 | 02 | 1 | | @1 SPITZ (A. Lawrence) @9 ed. |
---|
A14 | 01 | | | @1 Information Science Research Institute, University of Nevada @2 Las Vegas @3 USA @Z 1 aut. @Z 2 aut. @Z 3 aut. |
---|
A20 | | | | @1 348-357 |
---|
A21 | | | | @1 2006 |
---|
A23 | 01 | | | @0 ENG |
---|
A26 | 01 | | | @0 3-540-32140-3 |
---|
A43 | 01 | | | @1 INIST @2 16343 @5 354000153630480310 |
---|
A44 | | | | @0 0000 @1 © 2008 INIST-CNRS. All rights reserved. |
---|
A45 | | | | @0 18 ref. |
---|
A47 | 01 | 1 | | @0 08-0029057 |
---|
A60 | | | | @1 P @2 C |
---|
A61 | | | | @0 A |
---|
A64 | 01 | 1 | | @0 Lecture notes in computer science |
---|
A66 | 01 | | | @0 DEU |
---|
C01 | 01 | | ENG | @0 OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors. |
---|
C02 | 01 | X | | @0 001D02B07C |
---|
C02 | 02 | X | | @0 001D02C04 |
---|
C02 | 03 | X | | @0 001D02C03 |
---|
C03 | 01 | X | FRE | @0 Traitement image @5 01 |
---|
C03 | 01 | X | ENG | @0 Image processing @5 01 |
---|
C03 | 01 | X | SPA | @0 Procesamiento imagen @5 01 |
---|
C03 | 02 | X | FRE | @0 Reconnaissance forme @5 02 |
---|
C03 | 02 | X | ENG | @0 Pattern recognition @5 02 |
---|
C03 | 02 | X | SPA | @0 Reconocimiento patrón @5 02 |
---|
C03 | 03 | X | FRE | @0 Analyse documentaire @5 03 |
---|
C03 | 03 | X | ENG | @0 Document analysis @5 03 |
---|
C03 | 03 | X | SPA | @0 Análisis documental @5 03 |
---|
C03 | 04 | X | FRE | @0 Structure document @5 04 |
---|
C03 | 04 | X | ENG | @0 Document structure @5 04 |
---|
C03 | 04 | X | SPA | @0 Estructura documental @5 04 |
---|
C03 | 05 | X | FRE | @0 Reconnaissance caractère @5 06 |
---|
C03 | 05 | X | ENG | @0 Character recognition @5 06 |
---|
C03 | 05 | X | SPA | @0 Reconocimiento carácter @5 06 |
---|
C03 | 06 | X | FRE | @0 Reconnaissance optique caractère @5 07 |
---|
C03 | 06 | X | ENG | @0 Optical character recognition @5 07 |
---|
C03 | 06 | X | SPA | @0 Reconocimento óptico de caracteres @5 07 |
---|
C03 | 07 | X | FRE | @0 Extraction information @5 08 |
---|
C03 | 07 | X | ENG | @0 Information extraction @5 08 |
---|
C03 | 07 | X | SPA | @0 Extracción información @5 08 |
---|
C03 | 08 | X | FRE | @0 Vie privée @5 09 |
---|
C03 | 08 | X | ENG | @0 Private life @5 09 |
---|
C03 | 08 | X | SPA | @0 Vida privada @5 09 |
---|
C03 | 09 | X | FRE | @0 Linguistique @5 10 |
---|
C03 | 09 | X | ENG | @0 Linguistics @5 10 |
---|
C03 | 09 | X | SPA | @0 Linguística @5 10 |
---|
C03 | 10 | X | FRE | @0 Texte @5 11 |
---|
C03 | 10 | X | ENG | @0 Text @5 11 |
---|
C03 | 10 | X | SPA | @0 Texto @5 11 |
---|
C03 | 11 | X | FRE | @0 Classification @5 12 |
---|
C03 | 11 | X | ENG | @0 Classification @5 12 |
---|
C03 | 11 | X | SPA | @0 Clasificación @5 12 |
---|
N21 | | | | @1 052 |
---|
N44 | 01 | | | @1 OTO |
---|
N82 | | | | @1 OTO |
---|
|
pR |
A30 | 01 | 1 | ENG | @1 DAS 2006 @2 7 @3 Nelson NZL @4 2006 |
---|
|
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000301
Links to Exploration step
Pascal:08-0029057
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">The effects of OCR error on the extraction of private information</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Beckley, Russell" sort="Beckley, Russell" uniqKey="Beckley R" first="Russell" last="Beckley">Russell Beckley</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Coombs, Jeffrey" sort="Coombs, Jeffrey" uniqKey="Coombs J" first="Jeffrey" last="Coombs">Jeffrey Coombs</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">08-0029057</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 08-0029057 INIST</idno>
<idno type="RBID">Pascal:08-0029057</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000301</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000483</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">The effects of OCR error on the extraction of private information</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Beckley, Russell" sort="Beckley, Russell" uniqKey="Beckley R" first="Russell" last="Beckley">Russell Beckley</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
<author><name sortKey="Coombs, Jeffrey" sort="Coombs, Jeffrey" uniqKey="Coombs J" first="Jeffrey" last="Coombs">Jeffrey Coombs</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
<imprint><date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Lecture notes in computer science</title>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Classification</term>
<term>Document analysis</term>
<term>Document structure</term>
<term>Image processing</term>
<term>Information extraction</term>
<term>Linguistics</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Private life</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Traitement image</term>
<term>Reconnaissance forme</term>
<term>Analyse documentaire</term>
<term>Structure document</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Extraction information</term>
<term>Vie privée</term>
<term>Linguistique</term>
<term>Texte</term>
<term>Classification</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Linguistique</term>
<term>Classification</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>0302-9743</s0>
</fA01>
<fA05><s2>3872</s2>
</fA05>
<fA08 i1="01" i2="1" l="ENG"><s1>The effects of OCR error on the extraction of private information</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>Document analysis systems VII : 7th international workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006 : proceedings</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>TAGHVA (Kazem)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>BECKLEY (Russell)</s1>
</fA11>
<fA11 i1="03" i2="1"><s1>COOMBS (Jeffrey)</s1>
</fA11>
<fA12 i1="01" i2="1"><s1>BUNKE (Horst)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1"><s1>SPITZ (A. Lawrence)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA20><s1>348-357</s1>
</fA20>
<fA21><s1>2006</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA26 i1="01"><s0>3-540-32140-3</s0>
</fA26>
<fA43 i1="01"><s1>INIST</s1>
<s2>16343</s2>
<s5>354000153630480310</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2008 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>18 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>08-0029057</s0>
</fA47>
<fA60><s1>P</s1>
<s2>C</s2>
</fA60>
<fA64 i1="01" i2="1"><s0>Lecture notes in computer science</s0>
</fA64>
<fA66 i1="01"><s0>DEU</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>OCR error has been shown not to affect the average accuracy of text retrieval or text categorization. Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>001D02B07C</s0>
</fC02>
<fC02 i1="02" i2="X"><s0>001D02C04</s0>
</fC02>
<fC02 i1="03" i2="X"><s0>001D02C03</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Traitement image</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Image processing</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Procesamiento imagen</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Reconnaissance forme</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Pattern recognition</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Reconocimiento patrón</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Analyse documentaire</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Document analysis</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Análisis documental</s0>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Structure document</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Document structure</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Estructura documental</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Reconnaissance caractère</s0>
<s5>06</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Character recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Reconocimiento carácter</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>07</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>07</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE"><s0>Extraction information</s0>
<s5>08</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG"><s0>Information extraction</s0>
<s5>08</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA"><s0>Extracción información</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE"><s0>Vie privée</s0>
<s5>09</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG"><s0>Private life</s0>
<s5>09</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA"><s0>Vida privada</s0>
<s5>09</s5>
</fC03>
<fC03 i1="09" i2="X" l="FRE"><s0>Linguistique</s0>
<s5>10</s5>
</fC03>
<fC03 i1="09" i2="X" l="ENG"><s0>Linguistics</s0>
<s5>10</s5>
</fC03>
<fC03 i1="09" i2="X" l="SPA"><s0>Linguística</s0>
<s5>10</s5>
</fC03>
<fC03 i1="10" i2="X" l="FRE"><s0>Texte</s0>
<s5>11</s5>
</fC03>
<fC03 i1="10" i2="X" l="ENG"><s0>Text</s0>
<s5>11</s5>
</fC03>
<fC03 i1="10" i2="X" l="SPA"><s0>Texto</s0>
<s5>11</s5>
</fC03>
<fC03 i1="11" i2="X" l="FRE"><s0>Classification</s0>
<s5>12</s5>
</fC03>
<fC03 i1="11" i2="X" l="ENG"><s0>Classification</s0>
<s5>12</s5>
</fC03>
<fC03 i1="11" i2="X" l="SPA"><s0>Clasificación</s0>
<s5>12</s5>
</fC03>
<fN21><s1>052</s1>
</fN21>
<fN44 i1="01"><s1>OTO</s1>
</fN44>
<fN82><s1>OTO</s1>
</fN82>
</pA>
<pR><fA30 i1="01" i2="1" l="ENG"><s1>DAS 2006</s1>
<s2>7</s2>
<s3>Nelson NZL</s3>
<s4>2006</s4>
</fA30>
</pR>
</standard>
</inist>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000483 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000483 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien
|wiki= Ticri/CIDE
|area= OcrV1
|flux= PascalFrancis
|étape= Curation
|type= RBID
|clé= Pascal:08-0029057
|texte= The effects of OCR error on the extraction of private information
}}
| This area was generated with Dilib version V0.6.32. Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024 | ![](Common/icons/LogoDilib.gif) |