Transforming paper documents into XML format with WISDOM++
Identifieur interne : 001C76 ( Main/Merge ); précédent : 001C75; suivant : 001C77Transforming paper documents into XML format with WISDOM++
Auteurs : Oronzo Altamura [Italie] ; Floriana Esposito [Italie] ; Donato Malerba [Italie]Source :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2001.
Descripteurs français
- Pascal (Inist)
- Intelligence économique, Planning installation, Algorithme adaptatif, Reconnaissance caractère, Recherche information, Traitement document, Réseau web, Traitement image, Interprétation image, Recherche documentaire, Problème agencement, Conversion information, Navigation information, Internet, Analyse documentaire, Langage HTML, Arbre décision, XML.
- Wicri :
English descriptors
- KwdEn :
- Adaptive algorithm, Character recognition, Competitive intelligence, Decision tree, Document analysis, Document processing, Document retrieval, Extensible markup language, HTML language, Image interpretation, Image processing, Information browsing, Information conversion, Information retrieval, Internet, Layout problem, Plant layout, World wide web.
Abstract
The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding, WIS-DOM++ is a document processing system that operates in five steps: document analysis, document classification. document understanding, text recognition with an OCR, and text transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000698
- to stream PascalFrancis, to step Curation: 000094
- to stream PascalFrancis, to step Checkpoint: 000649
Links to Exploration step
Pascal:02-0009551Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Transforming paper documents into XML format with WISDOM++</title>
<author><name sortKey="Altamura, Oronzo" sort="Altamura, Oronzo" uniqKey="Altamura O" first="Oronzo" last="Altamura">Oronzo Altamura</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Esposito, Floriana" sort="Esposito, Floriana" uniqKey="Esposito F" first="Floriana" last="Esposito">Floriana Esposito</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Malerba, Donato" sort="Malerba, Donato" uniqKey="Malerba D" first="Donato" last="Malerba">Donato Malerba</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">02-0009551</idno>
<date when="2001">2001</date>
<idno type="stanalyst">PASCAL 02-0009551 INIST</idno>
<idno type="RBID">Pascal:02-0009551</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000698</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000094</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000649</idno>
<idno type="wicri:doubleKey">1433-2833:2001:Altamura O:transforming:paper:documents</idno>
<idno type="wicri:Area/Main/Merge">001C76</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Transforming paper documents into XML format with WISDOM++</title>
<author><name sortKey="Altamura, Oronzo" sort="Altamura, Oronzo" uniqKey="Altamura O" first="Oronzo" last="Altamura">Oronzo Altamura</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Esposito, Floriana" sort="Esposito, Floriana" uniqKey="Esposito F" first="Floriana" last="Esposito">Floriana Esposito</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Malerba, Donato" sort="Malerba, Donato" uniqKey="Malerba D" first="Donato" last="Malerba">Donato Malerba</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Dipartimento di Informatica, University degli Studi di Bari, via Orabona 4</s1>
<s2>70126 Bari</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>70126 Bari</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2001">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Adaptive algorithm</term>
<term>Character recognition</term>
<term>Competitive intelligence</term>
<term>Decision tree</term>
<term>Document analysis</term>
<term>Document processing</term>
<term>Document retrieval</term>
<term>Extensible markup language</term>
<term>HTML language</term>
<term>Image interpretation</term>
<term>Image processing</term>
<term>Information browsing</term>
<term>Information conversion</term>
<term>Information retrieval</term>
<term>Internet</term>
<term>Layout problem</term>
<term>Plant layout</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Intelligence économique</term>
<term>Planning installation</term>
<term>Algorithme adaptatif</term>
<term>Reconnaissance caractère</term>
<term>Recherche information</term>
<term>Traitement document</term>
<term>Réseau web</term>
<term>Traitement image</term>
<term>Interprétation image</term>
<term>Recherche documentaire</term>
<term>Problème agencement</term>
<term>Conversion information</term>
<term>Navigation information</term>
<term>Internet</term>
<term>Analyse documentaire</term>
<term>Langage HTML</term>
<term>Arbre décision</term>
<term>XML</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Intelligence économique</term>
<term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding, WIS-DOM++ is a document processing system that operates in five steps: document analysis, document classification. document understanding, text recognition with an OCR, and text transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.</div>
</front>
</TEI>
<affiliations><list><country><li>Italie</li>
</country>
</list>
<tree><country name="Italie"><noRegion><name sortKey="Altamura, Oronzo" sort="Altamura, Oronzo" uniqKey="Altamura O" first="Oronzo" last="Altamura">Oronzo Altamura</name>
</noRegion>
<name sortKey="Esposito, Floriana" sort="Esposito, Floriana" uniqKey="Esposito F" first="Floriana" last="Esposito">Floriana Esposito</name>
<name sortKey="Malerba, Donato" sort="Malerba, Donato" uniqKey="Malerba D" first="Donato" last="Malerba">Donato Malerba</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001C76 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 001C76 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= Pascal:02-0009551 |texte= Transforming paper documents into XML format with WISDOM++ }}
This area was generated with Dilib version V0.6.32. |