Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Page Frame Detection for Marginal Noise Removal from Scanned Documents

Identifieur interne : 000E30 ( Main/Merge ); précédent : 000E29; suivant : 000E31

Page Frame Detection for Marginal Noise Removal from Scanned Documents

Auteurs : Faisal Shafait [Allemagne] ; Joost Van Beusekom [Allemagne] ; Daniel Keysers [Allemagne] ; M. Breuel [Allemagne]

Source :

RBID : ISTEX:64EFFD5C6F7619CFA2A5269C1C633E03166E1D88

Abstract

Abstract: We describe and evaluate a method to robustly detect the page frame in document images, locating the actual page contents area and removing textual and non-textual noise along the page borders. We use a geometric matching algorithm to find the optimal page frame, which has the advantages of not assuming the existence of whitespace between noisy borders and actual page contents, and of giving a practical solution to the page frame detection problem without the need for parameter tuning. We define suitable performance measures and evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. In addition, we demonstrate that the use of page frame detection reduces the optical character recognition (OCR) error rate by removing textual noise. Experiments using a commercial OCR system show that the error rate due to elements outside the page frame is reduced from 4.3% to 1.7% on the UW-III dataset.

Url:
DOI: 10.1007/978-3-540-73040-8_66

Links toward previous steps (curation, corpus...)


Links to Exploration step

ISTEX:64EFFD5C6F7619CFA2A5269C1C633E03166E1D88

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Page Frame Detection for Marginal Noise Removal from Scanned Documents</title>
<author>
<name sortKey="Shafait, Faisal" sort="Shafait, Faisal" uniqKey="Shafait F" first="Faisal" last="Shafait">Faisal Shafait</name>
</author>
<author>
<name sortKey="Van Beusekom, Joost" sort="Van Beusekom, Joost" uniqKey="Van Beusekom J" first="Joost" last="Van Beusekom">Joost Van Beusekom</name>
</author>
<author>
<name sortKey="Keysers, Daniel" sort="Keysers, Daniel" uniqKey="Keysers D" first="Daniel" last="Keysers">Daniel Keysers</name>
</author>
<author>
<name sortKey="Breuel, M" sort="Breuel, M" uniqKey="Breuel M" first="M." last="Breuel">M. Breuel</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:64EFFD5C6F7619CFA2A5269C1C633E03166E1D88</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-73040-8_66</idno>
<idno type="url">https://api.istex.fr/document/64EFFD5C6F7619CFA2A5269C1C633E03166E1D88/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000232</idno>
<idno type="wicri:Area/Istex/Curation">000229</idno>
<idno type="wicri:Area/Istex/Checkpoint">000832</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Shafait F:page:frame:detection</idno>
<idno type="wicri:Area/Main/Merge">000E30</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Page Frame Detection for Marginal Noise Removal from Scanned Documents</title>
<author>
<name sortKey="Shafait, Faisal" sort="Shafait, Faisal" uniqKey="Shafait F" first="Faisal" last="Shafait">Faisal Shafait</name>
<affiliation wicri:level="3">
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Image Understanding and Pattern Recognition (IUPR) research group, German Research Center for Artificial Intelligence (DFKI) GmbH, D-67663 Kaiserslautern</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Van Beusekom, Joost" sort="Van Beusekom, Joost" uniqKey="Van Beusekom J" first="Joost" last="Van Beusekom">Joost Van Beusekom</name>
<affiliation wicri:level="4">
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Department of Computer Science, Technical University of Kaiserslautern, D-67663 Kaiserslautern</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
<orgName type="university">Université de technologie de Kaiserslautern</orgName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Keysers, Daniel" sort="Keysers, Daniel" uniqKey="Keysers D" first="Daniel" last="Keysers">Daniel Keysers</name>
<affiliation wicri:level="3">
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Image Understanding and Pattern Recognition (IUPR) research group, German Research Center for Artificial Intelligence (DFKI) GmbH, D-67663 Kaiserslautern</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Breuel, M" sort="Breuel, M" uniqKey="Breuel M" first="M." last="Breuel">M. Breuel</name>
<affiliation wicri:level="4">
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Department of Computer Science, Technical University of Kaiserslautern, D-67663 Kaiserslautern</wicri:regionArea>
<placeName>
<region type="land" nuts="2">Rhénanie-Palatinat</region>
<settlement type="city">Kaiserslautern</settlement>
</placeName>
<orgName type="university">Université de technologie de Kaiserslautern</orgName>
</affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Allemagne</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">64EFFD5C6F7619CFA2A5269C1C633E03166E1D88</idno>
<idno type="DOI">10.1007/978-3-540-73040-8_66</idno>
<idno type="ChapterID">66</idno>
<idno type="ChapterID">Chap66</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: We describe and evaluate a method to robustly detect the page frame in document images, locating the actual page contents area and removing textual and non-textual noise along the page borders. We use a geometric matching algorithm to find the optimal page frame, which has the advantages of not assuming the existence of whitespace between noisy borders and actual page contents, and of giving a practical solution to the page frame detection problem without the need for parameter tuning. We define suitable performance measures and evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. In addition, we demonstrate that the use of page frame detection reduces the optical character recognition (OCR) error rate by removing textual noise. Experiments using a commercial OCR system show that the error rate due to elements outside the page frame is reduced from 4.3% to 1.7% on the UW-III dataset.</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000E30 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000E30 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     ISTEX:64EFFD5C6F7619CFA2A5269C1C633E03166E1D88
   |texte=   Page Frame Detection for Marginal Noise Removal from Scanned Documents
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024