Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The detection of duplicates in document image databases

Identifieur interne : 000186 ( Istex/Corpus ); précédent : 000185; suivant : 000187

The detection of duplicates in document image databases

Auteurs : David Doermann ; Huiping Li ; Omid Kia

Source :

RBID : ISTEX:14D436D3D8870783B3BB81632855ECBEC0DF7FAD

Abstract

Document imaging technology has developed to the point where it is not uncommon for organizations to scan large numbers of documents into databases with little or no index information. This may be done for archival purposes with an index as simple as a case number, or with the ultimate goal of automatically extracting index information for content-based queries. Maintaining the integrity of such a database is difficult, however, especially in a distributed environment where copies of the same documents may be scanned at different times. In this paper we present a novel approach to detecting duplicate documents in very large databases using only features extracted from the image. The method is based on a robust `signature' extracted from each document image which is used to index into a table of previously processed documents. The system is able to deal with differences between scanned document instances such resolution, skew and image quality. The approach has a number of advantages over OCR or other recognition-based methods including speed and robustness to imaging distortions. To justify the approach and demonstrate its scalability, we have developed a simulator which allows us to change parameters of the system and examine performance while processing millions of document signatures. A complete system has been implemented and tested on a collection of technical articles and memos.

Url:
DOI: 10.1016/S0262-8856(98)00054-7

Links to Exploration step

ISTEX:14D436D3D8870783B3BB81632855ECBEC0DF7FAD

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>The detection of duplicates in document image databases</title>
<author>
<name sortKey="Doermann, David" sort="Doermann, David" uniqKey="Doermann D" first="David" last="Doermann">David Doermann</name>
<affiliation>
<mods:affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Li, Huiping" sort="Li, Huiping" uniqKey="Li H" first="Huiping" last="Li">Huiping Li</name>
<affiliation>
<mods:affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Kia, Omid" sort="Kia, Omid" uniqKey="Kia O" first="Omid" last="Kia">Omid Kia</name>
<affiliation>
<mods:affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:14D436D3D8870783B3BB81632855ECBEC0DF7FAD</idno>
<date when="1998" year="1998">1998</date>
<idno type="doi">10.1016/S0262-8856(98)00054-7</idno>
<idno type="url">https://api.istex.fr/document/14D436D3D8870783B3BB81632855ECBEC0DF7FAD/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000186</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">The detection of duplicates in document image databases</title>
<author>
<name sortKey="Doermann, David" sort="Doermann, David" uniqKey="Doermann D" first="David" last="Doermann">David Doermann</name>
<affiliation>
<mods:affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Li, Huiping" sort="Li, Huiping" uniqKey="Li H" first="Huiping" last="Li">Huiping Li</name>
<affiliation>
<mods:affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Kia, Omid" sort="Kia, Omid" uniqKey="Kia O" first="Omid" last="Kia">Omid Kia</name>
<affiliation>
<mods:affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Image and Vision Computing</title>
<title level="j" type="abbrev">IMAVIS</title>
<idno type="ISSN">0262-8856</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="1997">1997</date>
<biblScope unit="volume">16</biblScope>
<biblScope unit="issue">12–13</biblScope>
<biblScope unit="page" from="907">907</biblScope>
<biblScope unit="page" to="920">920</biblScope>
</imprint>
<idno type="ISSN">0262-8856</idno>
</series>
<idno type="istex">14D436D3D8870783B3BB81632855ECBEC0DF7FAD</idno>
<idno type="DOI">10.1016/S0262-8856(98)00054-7</idno>
<idno type="PII">S0262-8856(98)00054-7</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0262-8856</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Document imaging technology has developed to the point where it is not uncommon for organizations to scan large numbers of documents into databases with little or no index information. This may be done for archival purposes with an index as simple as a case number, or with the ultimate goal of automatically extracting index information for content-based queries. Maintaining the integrity of such a database is difficult, however, especially in a distributed environment where copies of the same documents may be scanned at different times. In this paper we present a novel approach to detecting duplicate documents in very large databases using only features extracted from the image. The method is based on a robust `signature' extracted from each document image which is used to index into a table of previously processed documents. The system is able to deal with differences between scanned document instances such resolution, skew and image quality. The approach has a number of advantages over OCR or other recognition-based methods including speed and robustness to imaging distortions. To justify the approach and demonstrate its scalability, we have developed a simulator which allows us to change parameters of the system and examine performance while processing millions of document signatures. A complete system has been implemented and tested on a collection of technical articles and memos.</div>
</front>
</TEI>
<istex>
<corpusName>elsevier</corpusName>
<author>
<json:item>
<name>David Doermann</name>
<affiliations>
<json:string>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Huiping Li</name>
<affiliations>
<json:string>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</json:string>
</affiliations>
</json:item>
<json:item>
<name>Omid Kia</name>
<affiliations>
<json:string>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</json:string>
</affiliations>
</json:item>
</author>
<subject>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Document image databases</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Duplicate detection</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Shape coding</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Document indexing</value>
</json:item>
</subject>
<language>
<json:string>eng</json:string>
</language>
<abstract>Document imaging technology has developed to the point where it is not uncommon for organizations to scan large numbers of documents into databases with little or no index information. This may be done for archival purposes with an index as simple as a case number, or with the ultimate goal of automatically extracting index information for content-based queries. Maintaining the integrity of such a database is difficult, however, especially in a distributed environment where copies of the same documents may be scanned at different times. In this paper we present a novel approach to detecting duplicate documents in very large databases using only features extracted from the image. The method is based on a robust `signature' extracted from each document image which is used to index into a table of previously processed documents. The system is able to deal with differences between scanned document instances such resolution, skew and image quality. The approach has a number of advantages over OCR or other recognition-based methods including speed and robustness to imaging distortions. To justify the approach and demonstrate its scalability, we have developed a simulator which allows us to change parameters of the system and examine performance while processing millions of document signatures. A complete system has been implemented and tested on a collection of technical articles and memos.</abstract>
<qualityIndicators>
<score>7.616</score>
<pdfVersion>1.2</pdfVersion>
<pdfPageSize>596 x 795 pts</pdfPageSize>
<refBibsNative>true</refBibsNative>
<keywordCount>4</keywordCount>
<abstractCharCount>1407</abstractCharCount>
<pdfWordCount>6823</pdfWordCount>
<pdfCharCount>39822</pdfCharCount>
<pdfPageCount>14</pdfPageCount>
<abstractWordCount>218</abstractWordCount>
</qualityIndicators>
<title>The detection of duplicates in document image databases</title>
<pii>
<json:string>S0262-8856(98)00054-7</json:string>
</pii>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<volume>16</volume>
<pii>
<json:string>S0262-8856(00)X0041-8</json:string>
</pii>
<pages>
<last>920</last>
<first>907</first>
</pages>
<issn>
<json:string>0262-8856</json:string>
</issn>
<issue>12–13</issue>
<genre>
<json:string>Journal</json:string>
</genre>
<language>
<json:string>unknown</json:string>
</language>
<title>Image and Vision Computing</title>
<publicationDate>1998</publicationDate>
</host>
<categories>
<wos>
<json:string>COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE</json:string>
<json:string>COMPUTER SCIENCE, SOFTWARE ENGINEERING</json:string>
<json:string>COMPUTER SCIENCE, THEORY & METHODS</json:string>
<json:string>ENGINEERING, ELECTRICAL & ELECTRONIC</json:string>
<json:string>OPTICS</json:string>
</wos>
</categories>
<publicationDate>1997</publicationDate>
<copyrightDate>1998</copyrightDate>
<doi>
<json:string>10.1016/S0262-8856(98)00054-7</json:string>
</doi>
<id>14D436D3D8870783B3BB81632855ECBEC0DF7FAD</id>
<fulltext>
<json:item>
<original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/14D436D3D8870783B3BB81632855ECBEC0DF7FAD/fulltext/pdf</uri>
</json:item>
<json:item>
<original>true</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/14D436D3D8870783B3BB81632855ECBEC0DF7FAD/fulltext/txt</uri>
</json:item>
<json:item>
<original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/14D436D3D8870783B3BB81632855ECBEC0DF7FAD/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/14D436D3D8870783B3BB81632855ECBEC0DF7FAD/fulltext/tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a">The detection of duplicates in document image databases</title>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher>ELSEVIER</publisher>
<availability>
<p>ELSEVIER</p>
</availability>
<date>1998</date>
</publicationStmt>
<notesStmt>
<note type="content">Fig. 1: Sample character shape code assignment.</note>
<note type="content">Fig. 2: Overview of indexing scheme.</note>
<note type="content">Fig. 3: Simulator overview.</note>
<note type="content">Fig. 4: Shape codes from sample signature 831 and its index keys.</note>
<note type="content">Fig. 5: Percentage of duplicate documents identified in the top n candidates documents retrieved.</note>
<note type="content">Fig. 6: The representative line chosen for a document with mixed text and graphics.</note>
<note type="content">Fig. 7: Shape coding results for part of line shown in Fig. 6.</note>
<note type="content">Fig. 8: Example xheight hypothesis, xline and baseline.</note>
<note type="content">Fig. 9: Example results: query document in the upper left and top ranked documents (left to right, top to bottom).</note>
<note type="content">Fig. 10: User interface.</note>
<note type="content">Table 1: For m=50, w=5, a=8, the index table has 32k buckets</note>
<note type="content">Table 2: Index table characteristics for w=5, m=50, N=50M, K=2Mb</note>
<note type="content">Table 3: Index table characteristics for a=8, m=50, N=50M, K=2Mb</note>
<note type="content">Table 4: Index table characteristics for a=8, w=5, N=50M, K=2Mb</note>
<note type="content">Table 5: Table of shape codes and symbols to which they apply</note>
<note type="content">Table 6: Scores of line 831 in [candidate (score)] format</note>
<note type="content">Table 7: Scores of line 5416, which was not in our database, in [candidate (score)] format</note>
<note type="content">Table 8: Top duplicate candidates in 100 queries</note>
<note type="content">Table 9: Number of duplicate candidates detected in 2500 queries from a pool of one million documents</note>
<note type="content">Table 10: Retrieval results: querying 307 duplicate document images against a database of 1347 documents</note>
<note type="content">Table 11: Average similarity: querying 307 duplicate document images against a database of 1347 documents</note>
</notesStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a">The detection of duplicates in document image databases</title>
<author>
<persName>
<forename type="first">David</forename>
<surname>Doermann</surname>
</persName>
<note type="correspondence">
<p>Corresponding author. Tel: +1 301 4051767; fax: +1 301 3149115.</p>
</note>
<affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</affiliation>
</author>
<author>
<persName>
<forename type="first">Huiping</forename>
<surname>Li</surname>
</persName>
<affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</affiliation>
</author>
<author>
<persName>
<forename type="first">Omid</forename>
<surname>Kia</surname>
</persName>
<affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</affiliation>
</author>
</analytic>
<monogr>
<title level="j">Image and Vision Computing</title>
<title level="j" type="abbrev">IMAVIS</title>
<idno type="pISSN">0262-8856</idno>
<idno type="PII">S0262-8856(00)X0041-8</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="1997"></date>
<biblScope unit="volume">16</biblScope>
<biblScope unit="issue">12–13</biblScope>
<biblScope unit="page" from="907">907</biblScope>
<biblScope unit="page" to="920">920</biblScope>
</imprint>
</monogr>
<idno type="istex">14D436D3D8870783B3BB81632855ECBEC0DF7FAD</idno>
<idno type="DOI">10.1016/S0262-8856(98)00054-7</idno>
<idno type="PII">S0262-8856(98)00054-7</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>1998</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract xml:lang="en">
<p>Document imaging technology has developed to the point where it is not uncommon for organizations to scan large numbers of documents into databases with little or no index information. This may be done for archival purposes with an index as simple as a case number, or with the ultimate goal of automatically extracting index information for content-based queries. Maintaining the integrity of such a database is difficult, however, especially in a distributed environment where copies of the same documents may be scanned at different times. In this paper we present a novel approach to detecting duplicate documents in very large databases using only features extracted from the image. The method is based on a robust `signature' extracted from each document image which is used to index into a table of previously processed documents. The system is able to deal with differences between scanned document instances such resolution, skew and image quality. The approach has a number of advantages over OCR or other recognition-based methods including speed and robustness to imaging distortions. To justify the approach and demonstrate its scalability, we have developed a simulator which allows us to change parameters of the system and examine performance while processing millions of document signatures. A complete system has been implemented and tested on a collection of technical articles and memos.</p>
</abstract>
<textClass>
<keywords scheme="keyword">
<list>
<head>Keywords</head>
<item>
<term>Document image databases</term>
</item>
<item>
<term>Duplicate detection</term>
</item>
<item>
<term>Shape coding</term>
</item>
<item>
<term>Document indexing</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change when="1997-11-12">Registration</change>
<change when="1997-11-05">Modified</change>
<change when="1997">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="Elsevier, elements deleted: ce:floats; body; tail">
<istex:xmlDeclaration>version="1.0" encoding="utf-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//ES//DTD journal article DTD version 4.5.2//EN//XML" URI="art452.dtd" name="istex:docType">
<istex:entity SYSTEM="gr1" NDATA="IMAGE" name="gr1"></istex:entity>
<istex:entity SYSTEM="gr2" NDATA="IMAGE" name="gr2"></istex:entity>
<istex:entity SYSTEM="gr3" NDATA="IMAGE" name="gr3"></istex:entity>
<istex:entity SYSTEM="gr4" NDATA="IMAGE" name="gr4"></istex:entity>
<istex:entity SYSTEM="gr5" NDATA="IMAGE" name="gr5"></istex:entity>
<istex:entity SYSTEM="gr6" NDATA="IMAGE" name="gr6"></istex:entity>
<istex:entity SYSTEM="gr8" NDATA="IMAGE" name="gr8"></istex:entity>
<istex:entity SYSTEM="gr7" NDATA="IMAGE" name="gr7"></istex:entity>
<istex:entity SYSTEM="gr9" NDATA="IMAGE" name="gr9"></istex:entity>
<istex:entity SYSTEM="gr10" NDATA="IMAGE" name="gr10"></istex:entity>
</istex:docType>
<istex:document>
<converted-article version="4.5.2" docsubtype="fla">
<item-info>
<jid>IMAVIS</jid>
<aid>1510</aid>
<ce:pii>S0262-8856(98)00054-7</ce:pii>
<ce:doi>10.1016/S0262-8856(98)00054-7</ce:doi>
<ce:copyright year="1998" type="full-transfer">Elsevier Science B.V.</ce:copyright>
</item-info>
<head>
<ce:title>The detection of duplicates in document image databases</ce:title>
<ce:author-group>
<ce:author>
<ce:given-name>David</ce:given-name>
<ce:surname>Doermann</ce:surname>
<ce:cross-ref refid="CORR1">*</ce:cross-ref>
</ce:author>
<ce:author>
<ce:given-name>Huiping</ce:given-name>
<ce:surname>Li</ce:surname>
</ce:author>
<ce:author>
<ce:given-name>Omid</ce:given-name>
<ce:surname>Kia</ce:surname>
</ce:author>
<ce:affiliation>
<ce:textfn>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</ce:textfn>
</ce:affiliation>
<ce:correspondence id="CORR1">
<ce:label>*</ce:label>
<ce:text>Corresponding author. Tel: +1 301 4051767; fax: +1 301 3149115.</ce:text>
</ce:correspondence>
</ce:author-group>
<ce:date-received day="21" month="3" year="1997"></ce:date-received>
<ce:date-revised day="5" month="11" year="1997"></ce:date-revised>
<ce:date-accepted day="12" month="11" year="1997"></ce:date-accepted>
<ce:abstract>
<ce:section-title>Abstract</ce:section-title>
<ce:abstract-sec>
<ce:simple-para>Document imaging technology has developed to the point where it is not uncommon for organizations to scan large numbers of documents into databases with little or no index information. This may be done for archival purposes with an index as simple as a case number, or with the ultimate goal of automatically extracting index information for content-based queries. Maintaining the integrity of such a database is difficult, however, especially in a distributed environment where copies of the same documents may be scanned at different times.</ce:simple-para>
<ce:simple-para>In this paper we present a novel approach to detecting duplicate documents in very large databases using only features extracted from the image. The method is based on a robust `signature' extracted from each document image which is used to index into a table of previously processed documents. The system is able to deal with differences between scanned document instances such resolution, skew and image quality. The approach has a number of advantages over OCR or other recognition-based methods including speed and robustness to imaging distortions.</ce:simple-para>
<ce:simple-para>To justify the approach and demonstrate its scalability, we have developed a simulator which allows us to change parameters of the system and examine performance while processing millions of document signatures. A complete system has been implemented and tested on a collection of technical articles and memos.</ce:simple-para>
</ce:abstract-sec>
</ce:abstract>
<ce:keywords class="keyword">
<ce:section-title>Keywords</ce:section-title>
<ce:keyword>
<ce:text>Document image databases</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Duplicate detection</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Shape coding</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Document indexing</ce:text>
</ce:keyword>
</ce:keywords>
</head>
</converted-article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo>
<title>The detection of duplicates in document image databases</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA">
<title>The detection of duplicates in document image databases</title>
</titleInfo>
<name type="personal">
<namePart type="given">David</namePart>
<namePart type="family">Doermann</namePart>
<affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</affiliation>
<description>Corresponding author. Tel: +1 301 4051767; fax: +1 301 3149115.</description>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Huiping</namePart>
<namePart type="family">Li</namePart>
<affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Omid</namePart>
<namePart type="family">Kia</namePart>
<affiliation>Language and Media Processing Laboratory, Center for Automation Research, University of Maryland, College Park, MD 20742-3275, USA</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="Full-length article"></genre>
<originInfo>
<publisher>ELSEVIER</publisher>
<dateIssued encoding="w3cdtf">1997</dateIssued>
<dateValid encoding="w3cdtf">1997-11-12</dateValid>
<dateModified encoding="w3cdtf">1997-11-05</dateModified>
<copyrightDate encoding="w3cdtf">1998</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<physicalDescription>
<internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract lang="en">Document imaging technology has developed to the point where it is not uncommon for organizations to scan large numbers of documents into databases with little or no index information. This may be done for archival purposes with an index as simple as a case number, or with the ultimate goal of automatically extracting index information for content-based queries. Maintaining the integrity of such a database is difficult, however, especially in a distributed environment where copies of the same documents may be scanned at different times. In this paper we present a novel approach to detecting duplicate documents in very large databases using only features extracted from the image. The method is based on a robust `signature' extracted from each document image which is used to index into a table of previously processed documents. The system is able to deal with differences between scanned document instances such resolution, skew and image quality. The approach has a number of advantages over OCR or other recognition-based methods including speed and robustness to imaging distortions. To justify the approach and demonstrate its scalability, we have developed a simulator which allows us to change parameters of the system and examine performance while processing millions of document signatures. A complete system has been implemented and tested on a collection of technical articles and memos.</abstract>
<note type="content">Fig. 1: Sample character shape code assignment.</note>
<note type="content">Fig. 2: Overview of indexing scheme.</note>
<note type="content">Fig. 3: Simulator overview.</note>
<note type="content">Fig. 4: Shape codes from sample signature 831 and its index keys.</note>
<note type="content">Fig. 5: Percentage of duplicate documents identified in the top n candidates documents retrieved.</note>
<note type="content">Fig. 6: The representative line chosen for a document with mixed text and graphics.</note>
<note type="content">Fig. 7: Shape coding results for part of line shown in Fig. 6.</note>
<note type="content">Fig. 8: Example xheight hypothesis, xline and baseline.</note>
<note type="content">Fig. 9: Example results: query document in the upper left and top ranked documents (left to right, top to bottom).</note>
<note type="content">Fig. 10: User interface.</note>
<note type="content">Table 1: For m=50, w=5, a=8, the index table has 32k buckets</note>
<note type="content">Table 2: Index table characteristics for w=5, m=50, N=50M, K=2Mb</note>
<note type="content">Table 3: Index table characteristics for a=8, m=50, N=50M, K=2Mb</note>
<note type="content">Table 4: Index table characteristics for a=8, w=5, N=50M, K=2Mb</note>
<note type="content">Table 5: Table of shape codes and symbols to which they apply</note>
<note type="content">Table 6: Scores of line 831 in [candidate (score)] format</note>
<note type="content">Table 7: Scores of line 5416, which was not in our database, in [candidate (score)] format</note>
<note type="content">Table 8: Top duplicate candidates in 100 queries</note>
<note type="content">Table 9: Number of duplicate candidates detected in 2500 queries from a pool of one million documents</note>
<note type="content">Table 10: Retrieval results: querying 307 duplicate document images against a database of 1347 documents</note>
<note type="content">Table 11: Average similarity: querying 307 duplicate document images against a database of 1347 documents</note>
<subject>
<genre>Keywords</genre>
<topic>Document image databases</topic>
<topic>Duplicate detection</topic>
<topic>Shape coding</topic>
<topic>Document indexing</topic>
</subject>
<relatedItem type="host">
<titleInfo>
<title>Image and Vision Computing</title>
</titleInfo>
<titleInfo type="abbreviated">
<title>IMAVIS</title>
</titleInfo>
<genre type="Journal">journal</genre>
<originInfo>
<dateIssued encoding="w3cdtf">19980824</dateIssued>
</originInfo>
<identifier type="ISSN">0262-8856</identifier>
<identifier type="PII">S0262-8856(00)X0041-8</identifier>
<part>
<date>19980824</date>
<detail type="volume">
<number>16</number>
<caption>vol.</caption>
</detail>
<detail type="issue">
<number>12–13</number>
<caption>no.</caption>
</detail>
<extent unit="issue pages">
<start>817</start>
<end>960</end>
</extent>
<extent unit="pages">
<start>907</start>
<end>920</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">14D436D3D8870783B3BB81632855ECBEC0DF7FAD</identifier>
<identifier type="DOI">10.1016/S0262-8856(98)00054-7</identifier>
<identifier type="PII">S0262-8856(98)00054-7</identifier>
<accessCondition type="use and reproduction" contentType="">© 1998Elsevier Science B.V.</accessCondition>
<recordInfo>
<recordContentSource>ELSEVIER</recordContentSource>
<recordOrigin>Elsevier Science B.V., ©1998</recordOrigin>
</recordInfo>
</mods>
</metadata>
<enrichments>
<istex:catWosTEI uri="https://api.istex.fr/document/14D436D3D8870783B3BB81632855ECBEC0DF7FAD/enrichments/catWos">
<teiHeader>
<profileDesc>
<textClass>
<classCode scheme="WOS">COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE</classCode>
<classCode scheme="WOS">COMPUTER SCIENCE, SOFTWARE ENGINEERING</classCode>
<classCode scheme="WOS">COMPUTER SCIENCE, THEORY & METHODS</classCode>
<classCode scheme="WOS">ENGINEERING, ELECTRICAL & ELECTRONIC</classCode>
<classCode scheme="WOS">OPTICS</classCode>
</textClass>
</profileDesc>
</teiHeader>
</istex:catWosTEI>
</enrichments>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000186 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000186 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:14D436D3D8870783B3BB81632855ECBEC0DF7FAD
   |texte=   The detection of duplicates in document image databases
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024