Content-Based Document Image Retrieval in Complex Document Collections
Identifieur interne : 000F55 ( Main/Merge ); précédent : 000F54; suivant : 000F56Content-Based Document Image Retrieval in Complex Document Collections
Auteurs : G. Agam [États-Unis] ; S. Argamon [États-Unis] ; O. Frieder [États-Unis] ; D. Grossman [États-Unis] ; D. Lewis [États-Unis]Source :
- Proceedings of Electronic Imaging Science and Technology
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
We address the problem of content-based image retrieval in the context of complex document images. Complex documents typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images.
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000255
- to stream PascalFrancis, to step Curation: 000529
- to stream PascalFrancis, to step Checkpoint: 000284
Links to Exploration step
Pascal:08-0459099Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Content-Based Document Image Retrieval in Complex Document Collections</title>
<author><name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>David D. Lewis Consulting, 858 W. Armitage Ave., #296</s1>
<s2>Chicago, IL 60614</s2>
<s3>USA</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">08-0459099</idno>
<date when="2007">2007</date>
<idno type="stanalyst">PASCAL 08-0459099 INIST</idno>
<idno type="RBID">Pascal:08-0459099</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000255</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000529</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000284</idno>
<idno type="wicri:Area/Main/Merge">000F55</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Content-Based Document Image Retrieval in Complex Document Collections</title>
<author><name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Department of Computer Science, Illinois Institute of Technology</s1>
<s2>Chicago, IL 60616</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
<affiliation wicri:level="2"><inist:fA14 i1="02"><s1>David D. Lewis Consulting, 858 W. Armitage Ave., #296</s1>
<s2>Chicago, IL 60614</s2>
<s3>USA</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Illinois</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Proceedings of Electronic Imaging Science and Technology</title>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Proceedings of Electronic Imaging Science and Technology</title>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Content-based retrieval</term>
<term>Document image processing</term>
<term>Image processing</term>
<term>Image retrieval</term>
<term>Indexing</term>
<term>Manuscript character</term>
<term>Off line</term>
<term>Optical character recognition</term>
<term>Printing</term>
<term>Safety</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Traitement image</term>
<term>Recherche par contenu</term>
<term>Recherche image</term>
<term>Traitement image document</term>
<term>Impression</term>
<term>Caractère manuscrit</term>
<term>Sécurité</term>
<term>Indexation</term>
<term>Reconnaissance optique caractère</term>
<term>Hors ligne</term>
<term>0130C</term>
<term>4230V</term>
<term>4230S</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We address the problem of content-based image retrieval in the context of complex document images. Complex documents typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. Large collections of such complex documents are commonly found in legal and security investigations. The indexing and analysis of large document collections is currently limited to textual features based OCR data and ignore the structural context of the document as well as important non-textual elements such as signatures, logos, stamps, tables, diagrams, and images. Handwritten comments are also normally ignored due to the inherent complexity of offline handwriting recognition. We address important research issues concerning content-based document image retrieval and describe a prototype for integrated retrieval and aggregation of diverse information contained in scanned paper documents we are developing. Such complex document information processing combines several forms of image processing together with textual/linguistic processing to enable effective analysis of complex document collections, a necessity for a wide range of applications. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are developing a test collection containing millions of document images.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Illinois</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Illinois"><name sortKey="Agam, G" sort="Agam, G" uniqKey="Agam G" first="G." last="Agam">G. Agam</name>
</region>
<name sortKey="Argamon, S" sort="Argamon, S" uniqKey="Argamon S" first="S." last="Argamon">S. Argamon</name>
<name sortKey="Frieder, O" sort="Frieder, O" uniqKey="Frieder O" first="O." last="Frieder">O. Frieder</name>
<name sortKey="Grossman, D" sort="Grossman, D" uniqKey="Grossman D" first="D." last="Grossman">D. Grossman</name>
<name sortKey="Lewis, D" sort="Lewis, D" uniqKey="Lewis D" first="D." last="Lewis">D. Lewis</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F55 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000F55 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= Pascal:08-0459099 |texte= Content-Based Document Image Retrieval in Complex Document Collections }}
This area was generated with Dilib version V0.6.32. |