OcrV1, PascalFrancis, Corpus, bibRecord, 000361

The effect of OCR errors on stylistic text classification

Identifieur interne : 000361 ( PascalFrancis/Corpus ); précédent : 000360; suivant : 000362

The effect of OCR errors on stylistic text classification

Auteurs : Sterling Stuart Stein ; Shlomo Argamon ; Ophir Frieder

Source :

RBID : Pascal:06-0519586

Descripteurs français

Pascal (Inist)
- Reconnaissance caractère, Reconnaissance optique caractère, Analyse contenu, Recherche information, Texte, Classification, Analyse texte.

English descriptors

KwdEn :
- Character recognition, Classification, Content analysis, Information retrieval, Optical character recognition, Text, Text analysis.

Abstract

Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

A08	`01`	`1`	`ENG`	`@1 The effect of OCR errors on stylistic text classification`
A09	`01`	`1`	`ENG`	`@1 SIGIR 2006 : proceedings of the Twenty-Ninth annual international ACM SIGIR Conference on research and development in information retrieval, August 6-11, 2006, Seattle, WA, USA`
A11	`01`	`1`		`@1 STEIN (Sterling Stuart)`
A11	`02`	`1`		`@1 ARGAMON (Shlomo)`
A11	`03`	`1`		`@1 FRIEDER (Ophir)`
A14	`01`			`@1 Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street @2 Chicago, IL 60616-3793 @3 USA @Z 1 aut. @Z 2 aut.`
A14	`02`			`@1 Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street @2 Chicago, IL 60616-3793 @3 USA @Z 3 aut.`
A18	`01`	`1`		`@1 Association for computing machinery @3 USA @9 org-cong.`
A20				`@1 701-702`
A21				`@1 2006`
A23	`01`			`@0 ENG`
A25	`01`			`@1 ACM Press @2 New York NY`
A26	`01`			`@0 1-59593-369-7`
A30	`01`	`1`	`ENG`	`@1 International ACM SIGIR conference on research and development in information retrieval @2 29 @3 Seattle WA USA @4 2006`
A43	`01`			`@1 INIST @2 Y 38973 @5 354000153508051280`
A44				`@0 0000 @1 © 2006 INIST-CNRS. All rights reserved.`
A45				`@0 5 ref.`
A47	`01`	`1`		`@0 06-0519586`
A60				`@1 C`
A61				`@0 A`
A66	`01`			`@0 USA`
C01	`01`		`ENG`	`@0 Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.`
C02	`01`	`X`		`@0 001D02C04`
C02	`02`	`X`		`@0 001D02B07D`
C03	`01`	`X`	`FRE`	`@0 Reconnaissance caractère @5 06`
C03	`01`	`X`	`ENG`	`@0 Character recognition @5 06`
C03	`01`	`X`	`SPA`	`@0 Reconocimiento carácter @5 06`
C03	`02`	`X`	`FRE`	`@0 Reconnaissance optique caractère @5 07`
C03	`02`	`X`	`ENG`	`@0 Optical character recognition @5 07`
C03	`02`	`X`	`SPA`	`@0 Reconocimento óptico de caracteres @5 07`
C03	`03`	`X`	`FRE`	`@0 Analyse contenu @5 08`
C03	`03`	`X`	`ENG`	`@0 Content analysis @5 08`
C03	`03`	`X`	`SPA`	`@0 Análisis contenido @5 08`
C03	`04`	`X`	`FRE`	`@0 Recherche information @5 09`
C03	`04`	`X`	`ENG`	`@0 Information retrieval @5 09`
C03	`04`	`X`	`SPA`	`@0 Búsqueda información @5 09`
C03	`05`	`X`	`FRE`	`@0 Texte @5 10`
C03	`05`	`X`	`ENG`	`@0 Text @5 10`
C03	`05`	`X`	`SPA`	`@0 Texto @5 10`
C03	`06`	`X`	`FRE`	`@0 Classification @5 11`
C03	`06`	`X`	`ENG`	`@0 Classification @5 11`
C03	`06`	`X`	`SPA`	`@0 Clasificación @5 11`
C03	`07`	`3`	`FRE`	`@0 Analyse texte @5 18`
C03	`07`	`3`	`ENG`	`@0 Text analysis @5 18`
N21				`@1 338`
N44	`01`			`@1 OTO`
N82				`@1 OTO`

Format Inist (serveur)

NO :	PASCAL 06-0519586 INIST
ET :	The effect of OCR errors on stylistic text classification
AU :	STEIN (Sterling Stuart); ARGAMON (Shlomo); FRIEDER (Ophir)
AF :	Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street/Chicago, IL 60616-3793/Etats-Unis (1 aut., 2 aut.); Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street/Chicago, IL 60616-3793/Etats-Unis (3 aut.)
DT :	Congrès; Niveau analytique
SO :	International ACM SIGIR conference on research and development in information retrieval/29/2006/Seattle WA USA; Etats-Unis; New York NY: ACM Press; Da. 2006; Pp. 701-702; ISBN 1-59593-369-7
LA :	Anglais
EA :	Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.
CC :	001D02C04; 001D02B07D
FD :	Reconnaissance caractère; Reconnaissance optique caractère; Analyse contenu; Recherche information; Texte; Classification; Analyse texte
ED :	Character recognition; Optical character recognition; Content analysis; Information retrieval; Text; Classification; Text analysis
SD :	Reconocimiento carácter; Reconocimento óptico de caracteres; Análisis contenido; Búsqueda información; Texto; Clasificación
LO :	INIST-Y 38973.354000153508051280
ID :	06-0519586

Links to Exploration step

Pascal:06-0519586

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">The effect of OCR errors on stylistic text classification</title>
<author><name sortKey="Stein, Sterling Stuart" sort="Stein, Sterling Stuart" uniqKey="Stein S" first="Sterling Stuart" last="Stein">Sterling Stuart Stein</name>
<affiliation><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Argamon, Shlomo" sort="Argamon, Shlomo" uniqKey="Argamon S" first="Shlomo" last="Argamon">Shlomo Argamon</name>
<affiliation><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Frieder, Ophir" sort="Frieder, Ophir" uniqKey="Frieder O" first="Ophir" last="Frieder">Ophir Frieder</name>
<affiliation><inist:fA14 i1="02"><s1>Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">06-0519586</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 06-0519586 INIST</idno>
<idno type="RBID">Pascal:06-0519586</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000361</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">The effect of OCR errors on stylistic text classification</title>
<author><name sortKey="Stein, Sterling Stuart" sort="Stein, Sterling Stuart" uniqKey="Stein S" first="Sterling Stuart" last="Stein">Sterling Stuart Stein</name>
<affiliation><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Argamon, Shlomo" sort="Argamon, Shlomo" uniqKey="Argamon S" first="Shlomo" last="Argamon">Shlomo Argamon</name>
<affiliation><inist:fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Frieder, Ophir" sort="Frieder, Ophir" uniqKey="Frieder O" first="Ophir" last="Frieder">Ophir Frieder</name>
<affiliation><inist:fA14 i1="02"><s1>Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Classification</term>
<term>Content analysis</term>
<term>Information retrieval</term>
<term>Optical character recognition</term>
<term>Text</term>
<term>Text analysis</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Analyse contenu</term>
<term>Recherche information</term>
<term>Texte</term>
<term>Classification</term>
<term>Analyse texte</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA08 i1="01" i2="1" l="ENG"><s1>The effect of OCR errors on stylistic text classification</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>SIGIR 2006 : proceedings of the Twenty-Ninth annual international ACM SIGIR Conference on research and development in information retrieval, August 6-11, 2006, Seattle, WA, USA</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>STEIN (Sterling Stuart)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>ARGAMON (Shlomo)</s1>
</fA11>
<fA11 i1="03" i2="1"><s1>FRIEDER (Ophir)</s1>
</fA11>
<fA14 i1="01"><s1>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</fA14>
<fA14 i1="02"><s1>Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street</s1>
<s2>Chicago, IL 60616-3793</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1"><s1>Association for computing machinery</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20><s1>701-702</s1>
</fA20>
<fA21><s1>2006</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA25 i1="01"><s1>ACM Press</s1>
<s2>New York NY</s2>
</fA25>
<fA26 i1="01"><s0>1-59593-369-7</s0>
</fA26>
<fA30 i1="01" i2="1" l="ENG"><s1>International ACM SIGIR conference on research and development in information retrieval</s1>
<s2>29</s2>
<s3>Seattle WA USA</s3>
<s4>2006</s4>
</fA30>
<fA43 i1="01"><s1>INIST</s1>
<s2>Y 38973</s2>
<s5>354000153508051280</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2006 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>5 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>06-0519586</s0>
</fA47>
<fA60><s1>C</s1>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA66 i1="01"><s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>001D02C04</s0>
</fC02>
<fC02 i1="02" i2="X"><s0>001D02B07D</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Reconnaissance caractère</s0>
<s5>06</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Character recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Reconocimiento carácter</s0>
<s5>06</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>07</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>07</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>07</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Analyse contenu</s0>
<s5>08</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Content analysis</s0>
<s5>08</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Análisis contenido</s0>
<s5>08</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Recherche information</s0>
<s5>09</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Information retrieval</s0>
<s5>09</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Búsqueda información</s0>
<s5>09</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Texte</s0>
<s5>10</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Text</s0>
<s5>10</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Texto</s0>
<s5>10</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Classification</s0>
<s5>11</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>Classification</s0>
<s5>11</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Clasificación</s0>
<s5>11</s5>
</fC03>
<fC03 i1="07" i2="3" l="FRE"><s0>Analyse texte</s0>
<s5>18</s5>
</fC03>
<fC03 i1="07" i2="3" l="ENG"><s0>Text analysis</s0>
<s5>18</s5>
</fC03>
<fN21><s1>338</s1>
</fN21>
<fN44 i1="01"><s1>OTO</s1>
</fN44>
<fN82><s1>OTO</s1>
</fN82>
</pA>
</standard>
<server><NO>PASCAL 06-0519586 INIST</NO>
<ET>The effect of OCR errors on stylistic text classification</ET>
<AU>STEIN (Sterling Stuart); ARGAMON (Shlomo); FRIEDER (Ophir)</AU>
<AF>Linguistic Cognition Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street/Chicago, IL 60616-3793/Etats-Unis (1 aut., 2 aut.); Information Retrieval Lab Computer Science Dept. Illinois Institute of Technology 3300 South Federal Street/Chicago, IL 60616-3793/Etats-Unis (3 aut.)</AF>
<DT>Congrès; Niveau analytique</DT>
<SO>International ACM SIGIR conference on research and development in information retrieval/29/2006/Seattle WA USA; Etats-Unis; New York NY: ACM Press; Da. 2006; Pp. 701-702; ISBN 1-59593-369-7</SO>
<LA>Anglais</LA>
<EA>Recently, interest is growing in non-topical text classification tasks such as genre classification, sentiment analysis, and authorship profiling. We study to what extent OCR errors affect stylistic text classification from scanned documents. We find that even a relatively high level of errors in the OCRed documents does not substantially affect stylistic classification accuracy.</EA>
<CC>001D02C04; 001D02B07D</CC>
<FD>Reconnaissance caractère; Reconnaissance optique caractère; Analyse contenu; Recherche information; Texte; Classification; Analyse texte</FD>
<ED>Character recognition; Optical character recognition; Content analysis; Information retrieval; Text; Classification; Text analysis</ED>
<SD>Reconocimiento carácter; Reconocimento óptico de caracteres; Análisis contenido; Búsqueda información; Texto; Clasificación</SD>
<LO>INIST-Y 38973.354000153508051280</LO>
<ID>06-0519586</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000361 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000361 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:06-0519586
   |texte=   The effect of OCR errors on stylistic text classification
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

The effect of OCR errors on stylistic text classification

The effect of OCR errors on stylistic text classification

Source :

Descripteurs français

English descriptors

Abstract

Notice en format standard (ISO 2709)

Format Inist (serveur)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri