OcrV1, Main, Merge, bibRecord, 000F55

Content-Based Document Image Retrieval in Complex Document Collections

Identifieur interne : 000F55 ( Main/Merge ); précédent : 000F54; suivant : 000F56

Content-Based Document Image Retrieval in Complex Document Collections

Auteurs : G. Agam [États-Unis] ; S. Argamon [États-Unis] ; O. Frieder [États-Unis] ; D. Grossman [États-Unis] ; D. Lewis [États-Unis]

Source :

Proceedings of Electronic Imaging Science and Technology

RBID : Pascal:08-0459099

Descripteurs français

Pascal (Inist)
- Traitement image, Recherche par contenu, Recherche image, Traitement image document, Impression, Caractère manuscrit, Sécurité, Indexation, Reconnaissance optique caractère, Hors ligne, 0130C, 4230V, 4230S.

English descriptors

KwdEn :
- Content-based retrieval, Document image processing, Image processing, Image retrieval, Indexing, Manuscript character, Off line, Optical character recognition, Printing, Safety.

Abstract

We address the problem of content-based image retrieval in the context of complex document images. Complex documents typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images.

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000255
to stream PascalFrancis, to step Curation: 000529
to stream PascalFrancis, to step Checkpoint: 000284

Links to Exploration step

Pascal:08-0459099

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Content-Based Document Image Retrieval in Complex Document Collections</title>
<author><name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>David D. Lewis Consulting, 858 W. Armitage Ave., #296</s1>
<s2>Chicago, IL 60614</s2>
<s3>USA</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">08-0459099</idno>
<date when="2007">2007</date>
<idno type="stanalyst">PASCAL 08-0459099 INIST</idno>
<idno type="RBID">Pascal:08-0459099</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000255</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000529</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000284</idno>
<idno type="wicri:Area/Main/Merge">000F55</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Content-Based Document Image Retrieval in Complex Document Collections</title>
<author><name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>David D. Lewis Consulting, 858 W. Armitage Ave., #296</s1>
<s2>Chicago, IL 60614</s2>
<s3>USA</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Proceedings of Electronic Imaging Science and Technology</title>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Proceedings of Electronic Imaging Science and Technology</title>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Content-based retrieval</term>
<term>Document image processing</term>
<term>Image processing</term>
<term>Image retrieval</term>
<term>Indexing</term>
<term>Manuscript character</term>
<term>Off line</term>
<term>Optical character recognition</term>
<term>Printing</term>
<term>Safety</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Traitement image</term>
<term>Recherche par contenu</term>
<term>Recherche image</term>
<term>Traitement image document</term>
<term>Impression</term>
<term>Caractère manuscrit</term>
<term>Sécurité</term>
<term>Indexation</term>
<term>Reconnaissance optique caractère</term>
<term>Hors ligne</term>
<term>0130C</term>
<term>4230V</term>
<term>4230S</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We address the problem of content-based image retrieval in the context of complex document images. Complex documents typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Illinois</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Illinois"><name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
</region>
<name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F55 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000F55 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     Pascal:08-0459099
   |texte=   Content-Based Document Image Retrieval in Complex Document Collections
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Content-Based Document Image Retrieval in Complex Document Collections

Content-Based Document Image Retrieval in Complex Document Collections

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri