Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Structure recognition and information extraction from tabular documents

Identifieur interne : 000330 ( Istex/Curation ); précédent : 000329; suivant : 000331

Structure recognition and information extraction from tabular documents

Auteurs : Surekha Chandran [États-Unis] ; Sanjay Balasubramanian [États-Unis] ; Tarak Gandhi [États-Unis] ; Arathi Prasad [États-Unis] ; Rangachar Kasturi [États-Unis] ; Atul Chhabra [États-Unis]

Source :

RBID : ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC

Abstract

We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.

Url:
DOI: 10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4

Links toward previous steps (curation, corpus...)


Links to Exploration step

ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Structure recognition and information extraction from tabular documents</title>
<author>
<name sortKey="Chandran, Surekha" sort="Chandran, Surekha" uniqKey="Chandran S" first="Surekha" last="Chandran">Surekha Chandran</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Balasubramanian, Sanjay" sort="Balasubramanian, Sanjay" uniqKey="Balasubramanian S" first="Sanjay" last="Balasubramanian">Sanjay Balasubramanian</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Gandhi, Tarak" sort="Gandhi, Tarak" uniqKey="Gandhi T" first="Tarak" last="Gandhi">Tarak Gandhi</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Prasad, Arathi" sort="Prasad, Arathi" uniqKey="Prasad A" first="Arathi" last="Prasad">Arathi Prasad</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Kasturi, Rangachar" sort="Kasturi, Rangachar" uniqKey="Kasturi R" first="Rangachar" last="Kasturi">Rangachar Kasturi</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Chhabra, Atul" sort="Chhabra, Atul" uniqKey="Chhabra A" first="Atul" last="Chhabra">Atul Chhabra</name>
<affiliation wicri:level="2">
<mods:affiliation>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">État de New York</region>
</placeName>
<wicri:cityArea>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains</wicri:cityArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC</idno>
<date when="1996" year="1996">1996</date>
<idno type="doi">10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4</idno>
<idno type="url">https://api.istex.fr/document/12A55967131C87E335F57E8D56C5013369EB88BC/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000335</idno>
<idno type="wicri:Area/Istex/Curation">000330</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Structure recognition and information extraction from tabular documents</title>
<author>
<name sortKey="Chandran, Surekha" sort="Chandran, Surekha" uniqKey="Chandran S" first="Surekha" last="Chandran">Surekha Chandran</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Balasubramanian, Sanjay" sort="Balasubramanian, Sanjay" uniqKey="Balasubramanian S" first="Sanjay" last="Balasubramanian">Sanjay Balasubramanian</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Gandhi, Tarak" sort="Gandhi, Tarak" uniqKey="Gandhi T" first="Tarak" last="Gandhi">Tarak Gandhi</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Prasad, Arathi" sort="Prasad, Arathi" uniqKey="Prasad A" first="Arathi" last="Prasad">Arathi Prasad</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Kasturi, Rangachar" sort="Kasturi, Rangachar" uniqKey="Kasturi R" first="Rangachar" last="Kasturi">Rangachar Kasturi</name>
<affiliation wicri:level="2">
<mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
<wicri:cityArea>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park</wicri:cityArea>
</affiliation>
</author>
<author>
<name sortKey="Chhabra, Atul" sort="Chhabra, Atul" uniqKey="Chhabra A" first="Atul" last="Chhabra">Atul Chhabra</name>
<affiliation wicri:level="2">
<mods:affiliation>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</mods:affiliation>
<country xml:lang="fr">États-Unis</country>
<placeName>
<region type="state">État de New York</region>
</placeName>
<wicri:cityArea>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains</wicri:cityArea>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">International Journal of Imaging Systems and Technology</title>
<title level="j" type="abbrev">Int. J. Imaging Syst. Technol.</title>
<idno type="ISSN">0899-9457</idno>
<idno type="eISSN">1098-1098</idno>
<imprint>
<publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>Hoboken</pubPlace>
<date type="published" when="1996-12">1996-12</date>
<biblScope unit="volume">7</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="289">289</biblScope>
<biblScope unit="page" to="303">303</biblScope>
</imprint>
<idno type="ISSN">0899-9457</idno>
</series>
<idno type="istex">12A55967131C87E335F57E8D56C5013369EB88BC</idno>
<idno type="DOI">10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4</idno>
<idno type="ArticleID">IMA4</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0899-9457</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Istex/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000330 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Curation/biblio.hfd -nk 000330 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Istex
   |étape=   Curation
   |type=    RBID
   |clé=     ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC
   |texte=   Structure recognition and information extraction from tabular documents
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024