OcrV1, Istex, Corpus, bibRecord, 000335

Structure recognition and information extraction from tabular documents

Identifieur interne : 000335 ( Istex/Corpus ); précédent : 000334; suivant : 000336

Structure recognition and information extraction from tabular documents

Auteurs : Surekha Chandran ; Sanjay Balasubramanian ; Tarak Gandhi ; Arathi Prasad ; Rangachar Kasturi ; Atul Chhabra

Source :

International Journal of Imaging Systems and Technology [ 0899-9457 ] ; 1996-12.

RBID : ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC

Abstract

We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.

Url:

https://api.istex.fr/document/12A55967131C87E335F57E8D56C5013369EB88BC/fulltext/pdf

DOI: 10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4

Links to Exploration step

ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Structure recognition and information extraction from tabular documents</title>
<author><name sortKey="Chandran, Surekha" sort="Chandran, Surekha" uniqKey="Chandran S" first="Surekha" last="Chandran">Surekha Chandran</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Balasubramanian, Sanjay" sort="Balasubramanian, Sanjay" uniqKey="Balasubramanian S" first="Sanjay" last="Balasubramanian">Sanjay Balasubramanian</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Gandhi, Tarak" sort="Gandhi, Tarak" uniqKey="Gandhi T" first="Tarak" last="Gandhi">Tarak Gandhi</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Prasad, Arathi" sort="Prasad, Arathi" uniqKey="Prasad A" first="Arathi" last="Prasad">Arathi Prasad</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Kasturi, Rangachar" sort="Kasturi, Rangachar" uniqKey="Kasturi R" first="Rangachar" last="Kasturi">Rangachar Kasturi</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Chhabra, Atul" sort="Chhabra, Atul" uniqKey="Chhabra A" first="Atul" last="Chhabra">Atul Chhabra</name>
<affiliation><mods:affiliation>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC</idno>
<date when="1996" year="1996">1996</date>
<idno type="doi">10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4</idno>
<idno type="url">https://api.istex.fr/document/12A55967131C87E335F57E8D56C5013369EB88BC/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000335</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Structure recognition and information extraction from tabular documents</title>
<author><name sortKey="Chandran, Surekha" sort="Chandran, Surekha" uniqKey="Chandran S" first="Surekha" last="Chandran">Surekha Chandran</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Balasubramanian, Sanjay" sort="Balasubramanian, Sanjay" uniqKey="Balasubramanian S" first="Sanjay" last="Balasubramanian">Sanjay Balasubramanian</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Gandhi, Tarak" sort="Gandhi, Tarak" uniqKey="Gandhi T" first="Tarak" last="Gandhi">Tarak Gandhi</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Prasad, Arathi" sort="Prasad, Arathi" uniqKey="Prasad A" first="Arathi" last="Prasad">Arathi Prasad</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Kasturi, Rangachar" sort="Kasturi, Rangachar" uniqKey="Kasturi R" first="Rangachar" last="Kasturi">Rangachar Kasturi</name>
<affiliation><mods:affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Chhabra, Atul" sort="Chhabra, Atul" uniqKey="Chhabra A" first="Atul" last="Chhabra">Atul Chhabra</name>
<affiliation><mods:affiliation>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">International Journal of Imaging Systems and Technology</title>
<title level="j" type="abbrev">Int. J. Imaging Syst. Technol.</title>
<idno type="ISSN">0899-9457</idno>
<idno type="eISSN">1098-1098</idno>
<imprint><publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>Hoboken</pubPlace>
<date type="published" when="1996-12">1996-12</date>
<biblScope unit="volume">7</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="289">289</biblScope>
<biblScope unit="page" to="303">303</biblScope>
</imprint>
<idno type="ISSN">0899-9457</idno>
</series>
<idno type="istex">12A55967131C87E335F57E8D56C5013369EB88BC</idno>
<idno type="DOI">10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4</idno>
<idno type="ArticleID">IMA4</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0899-9457</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.</div>
</front>
</TEI>
<istex><corpusName>wiley</corpusName>
<author><json:item><name>Surekha Chandran</name>
<affiliations><json:string>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</json:string>
</affiliations>
</json:item>
<json:item><name>Sanjay Balasubramanian</name>
<affiliations><json:string>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</json:string>
</affiliations>
</json:item>
<json:item><name>Tarak Gandhi</name>
<affiliations><json:string>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</json:string>
</affiliations>
</json:item>
<json:item><name>Arathi Prasad</name>
<affiliations><json:string>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</json:string>
</affiliations>
</json:item>
<json:item><name>Rangachar Kasturi</name>
<affiliations><json:string>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</json:string>
</affiliations>
</json:item>
<json:item><name>Atul Chhabra</name>
<affiliations><json:string>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</json:string>
</affiliations>
</json:item>
</author>
<articleId><json:string>IMA4</json:string>
</articleId>
<language><json:string>eng</json:string>
</language>
<abstract>We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.</abstract>
<qualityIndicators><score>8</score>
<pdfVersion>1.3</pdfVersion>
<pdfPageSize>612 x 792 pts (letter)</pdfPageSize>
<refBibsNative>true</refBibsNative>
<keywordCount>0</keywordCount>
<abstractCharCount>1883</abstractCharCount>
<pdfWordCount>8364</pdfWordCount>
<pdfCharCount>48046</pdfCharCount>
<pdfPageCount>15</pdfPageCount>
<abstractWordCount>287</abstractWordCount>
</qualityIndicators>
<title>Structure recognition and information extraction from tabular documents</title>
<genre.original><json:string>article</json:string>
</genre.original>
<genre><json:string>article</json:string>
</genre>
<host><volume>7</volume>
<publisherId><json:string>IMA</json:string>
</publisherId>
<pages><total>15</total>
<last>303</last>
<first>289</first>
</pages>
<issn><json:string>0899-9457</json:string>
</issn>
<issue>4</issue>
<genre><json:string>journal</json:string>
</genre>
<language><json:string>unknown</json:string>
</language>
<eissn><json:string>1098-1098</json:string>
</eissn>
<title>International Journal of Imaging Systems and Technology</title>
<doi><json:string>10.1002/(ISSN)1098-1098</json:string>
</doi>
</host>
<publicationDate>1996</publicationDate>
<copyrightDate>1996</copyrightDate>
<doi><json:string>10.1002/(SICI)1098-1098(199624)7:4>289::AID-IMA4>3.0.CO;2-4</json:string>
</doi>
<id>12A55967131C87E335F57E8D56C5013369EB88BC</id>
<fulltext><json:item><original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/12A55967131C87E335F57E8D56C5013369EB88BC/fulltext/pdf</uri>
</json:item>
<json:item><original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/12A55967131C87E335F57E8D56C5013369EB88BC/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/12A55967131C87E335F57E8D56C5013369EB88BC/fulltext/tei"><teiHeader><fileDesc><titleStmt><title level="a" type="main" xml:lang="en">Structure recognition and information extraction from tabular documents</title>
</titleStmt>
<publicationStmt><authority>ISTEX</authority>
<publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>Hoboken</pubPlace>
<availability><p>WILEY</p>
</availability>
<date>1996</date>
</publicationStmt>
<sourceDesc><biblStruct type="inbook"><analytic><title level="a" type="main" xml:lang="en">Structure recognition and information extraction from tabular documents</title>
<author><persName><forename type="first">Surekha</forename>
<surname>Chandran</surname>
</persName>
<note type="correspondence"><p>Correspondence: Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</p>
</note>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
</author>
<author><persName><forename type="first">Sanjay</forename>
<surname>Balasubramanian</surname>
</persName>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
</author>
<author><persName><forename type="first">Tarak</forename>
<surname>Gandhi</surname>
</persName>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
</author>
<author><persName><forename type="first">Arathi</forename>
<surname>Prasad</surname>
</persName>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
</author>
<author><persName><forename type="first">Rangachar</forename>
<surname>Kasturi</surname>
</persName>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
</author>
<author><persName><forename type="first">Atul</forename>
<surname>Chhabra</surname>
</persName>
<affiliation>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</affiliation>
</author>
</analytic>
<monogr><title level="j">International Journal of Imaging Systems and Technology</title>
<title level="j" type="abbrev">Int. J. Imaging Syst. Technol.</title>
<idno type="pISSN">0899-9457</idno>
<idno type="eISSN">1098-1098</idno>
<idno type="DOI">10.1002/(ISSN)1098-1098</idno>
<imprint><publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>Hoboken</pubPlace>
<date type="published" when="1996-12"></date>
<biblScope unit="volume">7</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="289">289</biblScope>
<biblScope unit="page" to="303">303</biblScope>
</imprint>
</monogr>
<idno type="istex">12A55967131C87E335F57E8D56C5013369EB88BC</idno>
<idno type="DOI">10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4</idno>
<idno type="ArticleID">IMA4</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><creation><date>1996</date>
</creation>
<langUsage><language ident="en">en</language>
</langUsage>
<abstract xml:lang="en"><p>We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.</p>
</abstract>
</profileDesc>
<revisionDesc><change when="1996-12">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item><original>false</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/12A55967131C87E335F57E8D56C5013369EB88BC/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata><istex:metadataXml wicri:clean="Wiley, elements deleted: body"><istex:xmlDeclaration>version="1.0" encoding="UTF-8" standalone="yes"</istex:xmlDeclaration>
<istex:document><component version="2.0" type="serialArticle" xml:lang="en"><header><publicationMeta level="product"><publisherInfo><publisherName>Wiley Subscription Services, Inc., A Wiley Company</publisherName>
<publisherLoc>Hoboken</publisherLoc>
</publisherInfo>
<doi registered="yes">10.1002/(ISSN)1098-1098</doi>
<issn type="print">0899-9457</issn>
<issn type="electronic">1098-1098</issn>
<idGroup><id type="product" value="IMA"></id>
</idGroup>
<titleGroup><title type="main" xml:lang="en" sort="INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY">International Journal of Imaging Systems and Technology</title>
<title type="short">Int. J. Imaging Syst. Technol.</title>
</titleGroup>
</publicationMeta>
<publicationMeta level="part" position="40"><doi origin="wiley" registered="yes">10.1002/(SICI)1098-1098(199624)7:4<>1.0.CO;2-4</doi>
<numberingGroup><numbering type="journalVolume" number="7">7</numbering>
<numbering type="journalIssue">4</numbering>
</numberingGroup>
<coverDate startDate="1996-12">Winter 1996</coverDate>
</publicationMeta>
<publicationMeta level="unit" type="article" position="50" status="forIssue"><doi origin="wiley" registered="yes">10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4</doi>
<idGroup><id type="unit" value="IMA4"></id>
</idGroup>
<countGroup><count type="pageTotal" number="15"></count>
</countGroup>
<copyright ownership="publisher">Copyright © 1996 John Wiley & Sons, Inc.</copyright>
<eventGroup><event type="manuscriptRevised" date="1996-05-29"></event>
<event type="firstOnline" date="1998-12-07"></event>
<event type="publishedOnlineFinalForm" date="1998-12-07"></event>
<event type="xmlConverted" agent="Converter:JWSART34_TO_WML3G version:2.3.2 mode:FullText source:HeaderRef result:HeaderRef" date="2010-03-04"></event>
<event type="xmlConverted" agent="Converter:WILEY_ML3G_TO_WILEY_ML3GV2 version:3.8.8" date="2014-01-28"></event>
<event type="xmlConverted" agent="Converter:WML3G_To_WML3G version:4.1.7 mode:FullText,remove_FC" date="2014-10-23"></event>
</eventGroup>
<numberingGroup><numbering type="pageFirst">289</numbering>
<numbering type="pageLast">303</numbering>
</numberingGroup>
<correspondenceTo>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</correspondenceTo>
<linkGroup><link type="toTypesetVersion" href="file:IMA.IMA4.pdf"></link>
</linkGroup>
</publicationMeta>
<contentMeta><countGroup><count type="figureTotal" number="26"></count>
<count type="tableTotal" number="0"></count>
<count type="referenceTotal" number="11"></count>
</countGroup>
<titleGroup><title type="main" xml:lang="en">Structure recognition and information extraction from tabular documents</title>
</titleGroup>
<creators><creator xml:id="au1" creatorRole="author" affiliationRef="#af1" corresponding="yes"><personName><givenNames>Surekha</givenNames>
<familyName>Chandran</familyName>
</personName>
</creator>
<creator xml:id="au2" creatorRole="author" affiliationRef="#af1"><personName><givenNames>Sanjay</givenNames>
<familyName>Balasubramanian</familyName>
</personName>
</creator>
<creator xml:id="au3" creatorRole="author" affiliationRef="#af1"><personName><givenNames>Tarak</givenNames>
<familyName>Gandhi</familyName>
</personName>
</creator>
<creator xml:id="au4" creatorRole="author" affiliationRef="#af1"><personName><givenNames>Arathi</givenNames>
<familyName>Prasad</familyName>
</personName>
</creator>
<creator xml:id="au5" creatorRole="author" affiliationRef="#af1"><personName><givenNames>Rangachar</givenNames>
<familyName>Kasturi</familyName>
</personName>
</creator>
<creator xml:id="au6" creatorRole="author" affiliationRef="#af2"><personName><givenNames>Atul</givenNames>
<familyName>Chhabra</familyName>
</personName>
</creator>
</creators>
<affiliationGroup><affiliation xml:id="af1" countryCode="US" type="organization"><unparsedAffiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</unparsedAffiliation>
</affiliation>
<affiliation xml:id="af2" countryCode="US" type="organization"><unparsedAffiliation>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</unparsedAffiliation>
</affiliation>
</affiliationGroup>
<abstractGroup><abstract type="main" xml:lang="en"><title type="main">Abstract</title>
<p>We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.</p>
</abstract>
</abstractGroup>
</contentMeta>
</header>
</component>
</istex:document>
</istex:metadataXml>
<mods version="3.6"><titleInfo lang="en"><title>Structure recognition and information extraction from tabular documents</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA" lang="en"><title>Structure recognition and information extraction from tabular documents</title>
</titleInfo>
<name type="personal"><namePart type="given">Surekha</namePart>
<namePart type="family">Chandran</namePart>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
<description>Correspondence: Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</description>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Sanjay</namePart>
<namePart type="family">Balasubramanian</namePart>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Tarak</namePart>
<namePart type="family">Gandhi</namePart>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Arathi</namePart>
<namePart type="family">Prasad</namePart>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Rangachar</namePart>
<namePart type="family">Kasturi</namePart>
<affiliation>Department of Computer Science and Engineering, Pond Laboratory, The Pennsylvania State University, University Park, PA 16802</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Atul</namePart>
<namePart type="family">Chhabra</namePart>
<affiliation>NYNEX Science and Technology, Inc., 500 Westchester Avenue, White Plains, NY 10604</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="article" displayLabel="article"></genre>
<originInfo><publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<place><placeTerm type="text">Hoboken</placeTerm>
</place>
<dateIssued encoding="w3cdtf">1996-12</dateIssued>
<copyrightDate encoding="w3cdtf">1996</copyrightDate>
</originInfo>
<language><languageTerm type="code" authority="rfc3066">en</languageTerm>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
</language>
<physicalDescription><internetMediaType>text/html</internetMediaType>
<extent unit="figures">26</extent>
<extent unit="references">11</extent>
</physicalDescription>
<abstract lang="en">We present a system for the extraction of the structural information of a table from its image. Following the initial binarization and deskewing operations, the image is scanned to extract all horizontal and vertical lines that may be present. The table's dimensions are estimated based on these lines. Unlike other systems, the procedure described here does not depend on the sole existence of lines to mark the item blocks. White streams are recognized in both the horizontal and vertical directions as substitutes for any missing demarcation lines. A structure interpretation procedure uses the extracted demarcation information to identify each of the item blocks in the table. Subsequently, the interrelations of these item blocks are used to recognize the structure of the tabulated data. The interpretation can be done for one‐dimensional as well as two‐dimensional tables. Interpretation of the tabular document involves character recognition, which in turn depends on the structure of the table. The above procedure to extract the structural information of the tabular document can be used to extract useful information from different types of tabular drawings. In this article, we focus our attention on interpreting telephone company central office drawings. These drawings contain additional information in the form of crossed‐out entries and repeated entries, which must be detected and recognized to interpret the document completely. Hence, after extracting the basic structure of the drawing, the additional information is extracted and cell block location is obtained in order to develop a data base representing the tabular document. The telephone company drawings are very large in size, resulting in images as large as 15,000 x 10,000 pixels. Thus, designing efficient and fast algorithms is an important criterion in this research. © 1996 John Wiley & Sons, Inc.</abstract>
<relatedItem type="host"><titleInfo><title>International Journal of Imaging Systems and Technology</title>
</titleInfo>
<titleInfo type="abbreviated"><title>Int. J. Imaging Syst. Technol.</title>
</titleInfo>
<genre type="journal">journal</genre>
<identifier type="ISSN">0899-9457</identifier>
<identifier type="eISSN">1098-1098</identifier>
<identifier type="DOI">10.1002/(ISSN)1098-1098</identifier>
<identifier type="PublisherID">IMA</identifier>
<part><date>1996</date>
<detail type="volume"><caption>vol.</caption>
<number>7</number>
</detail>
<detail type="issue"><caption>no.</caption>
<number>4</number>
</detail>
<extent unit="pages"><start>289</start>
<end>303</end>
<total>15</total>
</extent>
</part>
</relatedItem>
<identifier type="istex">12A55967131C87E335F57E8D56C5013369EB88BC</identifier>
<identifier type="DOI">10.1002/(SICI)1098-1098(199624)7:4<289::AID-IMA4>3.0.CO;2-4</identifier>
<identifier type="ArticleID">IMA4</identifier>
<accessCondition type="use and reproduction" contentType="copyright">Copyright © 1996 John Wiley & Sons, Inc.</accessCondition>
<recordInfo><recordContentSource>WILEY</recordContentSource>
<recordOrigin>Wiley Subscription Services, Inc., A Wiley Company</recordOrigin>
</recordInfo>
</mods>
</metadata>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Istex/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000335 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000335 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:12A55967131C87E335F57E8D56C5013369EB88BC
   |texte=   Structure recognition and information extraction from tabular documents
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Structure recognition and information extraction from tabular documents

Structure recognition and information extraction from tabular documents

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri