Effects of OCR errors on ranking and feedback using the vector space model
Identifieur interne : 002682 ( Main/Curation ); précédent : 002681; suivant : 002683Effects of OCR errors on ranking and feedback using the vector space model
Auteurs : Kazem Taghva [États-Unis] ; Julie Borsack [États-Unis] ; Allen Condit [États-Unis]Source :
- Information Processing and Management [ 0306-4573 ] ; 1996.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Essai, Recherche documentaire, Système documentaire.
English descriptors
- KwdEn :
Abstract
We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.
Url:
DOI: 10.1016/0306-4573(95)00058-5
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000017
- to stream Istex, to step Curation: Pour aller vers cette notice dans l'étape Curation :000017
- to stream Istex, to step Checkpoint: Pour aller vers cette notice dans l'étape Curation :001A93
- to stream Main, to step Merge: Pour aller vers cette notice dans l'étape Curation :002826
- to stream PascalFrancis, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000A07
- to stream PascalFrancis, to step Curation: Pour aller vers cette notice dans l'étape Curation :000991
- to stream PascalFrancis, to step Checkpoint: Pour aller vers cette notice dans l'étape Curation :000955
- to stream Main, to step Merge: Pour aller vers cette notice dans l'étape Curation :002A42
Links to Exploration step
ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title>Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
</author>
<author><name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
</author>
<author><name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<date when="1996" year="1996">1996</date>
<idno type="doi">10.1016/0306-4573(95)00058-5</idno>
<idno type="url">https://api.istex.fr/document/2022C26E3682F8C2CDD3580811393DEEE55E8CA8/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000017</idno>
<idno type="wicri:Area/Istex/Curation">000017</idno>
<idno type="wicri:Area/Istex/Checkpoint">001A93</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002826</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:96-0295002</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000A07</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000991</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000955</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002A42</idno>
<idno type="wicri:Area/Main/Curation">002682</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a">Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Information Processing and Management</title>
<title level="j" type="abbrev">IPM</title>
<idno type="ISSN">0306-4573</idno>
<imprint><publisher>ELSEVIER</publisher>
<date type="published" when="1996">1996</date>
<biblScope unit="volume">32</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="317">317</biblScope>
<biblScope unit="page" to="327">327</biblScope>
</imprint>
<idno type="ISSN">0306-4573</idno>
</series>
<idno type="istex">2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<idno type="DOI">10.1016/0306-4573(95)00058-5</idno>
<idno type="PII">0306-4573(95)00058-5</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0306-4573</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Document retrieval</term>
<term>Document retrieval system</term>
<term>Error</term>
<term>Full text</term>
<term>Influence</term>
<term>Optical reading</term>
<term>Test</term>
<term>Vector space model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Erreur</term>
<term>Essai</term>
<term>Influence</term>
<term>Lecture optique</term>
<term>Modèle espace vectoriel</term>
<term>Recherche documentaire</term>
<term>Reconnaissance caractère</term>
<term>Système documentaire</term>
<term>Texte intégral</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Essai</term>
<term>Recherche documentaire</term>
<term>Système documentaire</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.</div>
</front>
</TEI>
<double idat="0306-4573:1996:Taghva K:effects:of:ocr"><INIST><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas, Nev.</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas, Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas, Nev.</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas, Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas, Nev.</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas, Nev.</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">96-0295002</idno>
<date when="1996">1996</date>
<idno type="stanalyst">PASCAL 96-0295002 INIST</idno>
<idno type="RBID">Pascal:96-0295002</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000A07</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000991</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000955</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002A42</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, K" sort="Taghva, K" uniqKey="Taghva K" first="K." last="Taghva">K. Taghva</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas, Nev.</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas, Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Borsack, J" sort="Borsack, J" uniqKey="Borsack J" first="J." last="Borsack">J. Borsack</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas, Nev.</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas, Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Condit, A" sort="Condit, A" uniqKey="Condit A" first="A." last="Condit">A. Condit</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Information Science Research Institute, University of Nevada</s1>
<s2>Las Vegas, Nev.</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Las Vegas, Nev.</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Information processing & management</title>
<title level="j" type="abbreviated">Inf. process. manage.</title>
<idno type="ISSN">0306-4573</idno>
<imprint><date when="1996">1996</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Information processing & management</title>
<title level="j" type="abbreviated">Inf. process. manage.</title>
<idno type="ISSN">0306-4573</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Document retrieval</term>
<term>Document retrieval system</term>
<term>Error</term>
<term>Full text</term>
<term>Influence</term>
<term>Optical reading</term>
<term>Test</term>
<term>Vector space model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance caractère</term>
<term>Lecture optique</term>
<term>Erreur</term>
<term>Influence</term>
<term>Système documentaire</term>
<term>Texte intégral</term>
<term>Recherche documentaire</term>
<term>Essai</term>
<term>Modèle espace vectoriel</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Système documentaire</term>
<term>Recherche documentaire</term>
<term>Essai</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.</div>
</front>
</TEI>
</INIST>
<ISTEX><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title>Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
</author>
<author><name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
</author>
<author><name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<date when="1996" year="1996">1996</date>
<idno type="doi">10.1016/0306-4573(95)00058-5</idno>
<idno type="url">https://api.istex.fr/document/2022C26E3682F8C2CDD3580811393DEEE55E8CA8/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000017</idno>
<idno type="wicri:Area/Istex/Curation">000017</idno>
<idno type="wicri:Area/Istex/Checkpoint">001A93</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002826</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a">Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Information Processing and Management</title>
<title level="j" type="abbrev">IPM</title>
<idno type="ISSN">0306-4573</idno>
<imprint><publisher>ELSEVIER</publisher>
<date type="published" when="1996">1996</date>
<biblScope unit="volume">32</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="317">317</biblScope>
<biblScope unit="page" to="327">327</biblScope>
</imprint>
<idno type="ISSN">0306-4573</idno>
</series>
<idno type="istex">2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<idno type="DOI">10.1016/0306-4573(95)00058-5</idno>
<idno type="PII">0306-4573(95)00058-5</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0306-4573</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.</div>
</front>
</TEI>
</ISTEX>
</double>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002682 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Curation/biblio.hfd -nk 002682 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Curation |type= RBID |clé= ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8 |texte= Effects of OCR errors on ranking and feedback using the vector space model }}
This area was generated with Dilib version V0.6.32. |