Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Transforming paper documents into XML format with WISDOM++

Identifieur interne : 001C76 ( Main/Merge ); précédent : 001C75; suivant : 001C77

Transforming paper documents into XML format with WISDOM++

Auteurs : Oronzo Altamura [Italie] ; Floriana Esposito [Italie] ; Donato Malerba [Italie]

Source :

RBID : Pascal:02-0009551

Descripteurs français

English descriptors

Abstract

The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding, WIS-DOM++ is a document processing system that operates in five steps: document analysis, document classification. document understanding, text recognition with an OCR, and text transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:02-0009551

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Transforming paper documents into XML format with WISDOM++</title>
<author>
<name sortKey="Altamura, Oronzo" sort="Altamura, Oronzo" uniqKey="Altamura O" first="Oronzo" last="Altamura">Oronzo Altamura</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Esposito, Floriana" sort="Esposito, Floriana" uniqKey="Esposito F" first="Floriana" last="Esposito">Floriana Esposito</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Malerba, Donato" sort="Malerba, Donato" uniqKey="Malerba D" first="Donato" last="Malerba">Donato Malerba</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">02-0009551</idno>
<date when="2001">2001</date>
<idno type="stanalyst">PASCAL 02-0009551 INIST</idno>
<idno type="RBID">Pascal:02-0009551</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000698</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000094</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000649</idno>
<idno type="wicri:doubleKey">1433-2833:2001:Altamura O:transforming:paper:documents</idno>
<idno type="wicri:Area/Main/Merge">001C76</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Transforming paper documents into XML format with WISDOM++</title>
<author>
<name sortKey="Altamura, Oronzo" sort="Altamura, Oronzo" uniqKey="Altamura O" first="Oronzo" last="Altamura">Oronzo Altamura</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Esposito, Floriana" sort="Esposito, Floriana" uniqKey="Esposito F" first="Floriana" last="Esposito">Floriana Esposito</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Malerba, Donato" sort="Malerba, Donato" uniqKey="Malerba D" first="Donato" last="Malerba">Donato Malerba</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date when="2001">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Adaptive algorithm</term>
<term>Character recognition</term>
<term>Competitive intelligence</term>
<term>Decision tree</term>
<term>Document analysis</term>
<term>Document processing</term>
<term>Document retrieval</term>
<term>Extensible markup language</term>
<term>HTML language</term>
<term>Image interpretation</term>
<term>Image processing</term>
<term>Information browsing</term>
<term>Information conversion</term>
<term>Information retrieval</term>
<term>Internet</term>
<term>Layout problem</term>
<term>Plant layout</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Intelligence économique</term>
<term>Planning installation</term>
<term>Algorithme adaptatif</term>
<term>Reconnaissance caractère</term>
<term>Recherche information</term>
<term>Traitement document</term>
<term>Réseau web</term>
<term>Traitement image</term>
<term>Interprétation image</term>
<term>Recherche documentaire</term>
<term>Problème agencement</term>
<term>Conversion information</term>
<term>Navigation information</term>
<term>Internet</term>
<term>Analyse documentaire</term>
<term>Langage HTML</term>
<term>Arbre décision</term>
<term>XML</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Intelligence économique</term>
<term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding, WIS-DOM++ is a document processing system that operates in five steps: document analysis, document classification. document understanding, text recognition with an OCR, and text transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Italie</li>
</country>
</list>
<tree>
<country name="Italie">
<noRegion>
<name sortKey="Altamura, Oronzo" sort="Altamura, Oronzo" uniqKey="Altamura O" first="Oronzo" last="Altamura">Oronzo Altamura</name>
</noRegion>
<name sortKey="Esposito, Floriana" sort="Esposito, Floriana" uniqKey="Esposito F" first="Floriana" last="Esposito">Floriana Esposito</name>
<name sortKey="Malerba, Donato" sort="Malerba, Donato" uniqKey="Malerba D" first="Donato" last="Malerba">Donato Malerba</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001C76 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 001C76 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     Pascal:02-0009551
   |texte=   Transforming paper documents into XML format with WISDOM++
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024