Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Font adaptive word indexing of modern printed documents

Identifieur interne : 001167 ( Main/Curation ); précédent : 001166; suivant : 001168

Font adaptive word indexing of modern printed documents

Auteurs : Simone Marinai [Italie] ; Emanuele Marino [Italie] ; Giovanni Soda [Italie]

Source :

RBID : Pascal:06-0386778

Descripteurs français

English descriptors

Abstract

We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:06-0386778

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Font adaptive word indexing of modern printed documents</title>
<author>
<name sortKey="Marinai, Simone" sort="Marinai, Simone" uniqKey="Marinai S" first="Simone" last="Marinai">Simone Marinai</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marino, Emanuele" sort="Marino, Emanuele" uniqKey="Marino E" first="Emanuele" last="Marino">Emanuele Marino</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Soda, Giovanni" sort="Soda, Giovanni" uniqKey="Soda G" first="Giovanni" last="Soda">Giovanni Soda</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">06-0386778</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 06-0386778 INIST</idno>
<idno type="RBID">Pascal:06-0386778</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000377</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000409</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000335</idno>
<idno type="wicri:doubleKey">0162-8828:2006:Marinai S:font:adaptive:word</idno>
<idno type="wicri:Area/Main/Merge">001196</idno>
<idno type="wicri:source">PubMed</idno>
<idno type="RBID">pubmed:16886856</idno>
<idno type="wicri:Area/PubMed/Corpus">000063</idno>
<idno type="wicri:Area/PubMed/Curation">000063</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000063</idno>
<idno type="wicri:Area/Ncbi/Merge">000030</idno>
<idno type="wicri:Area/Ncbi/Curation">000030</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000030</idno>
<idno type="wicri:doubleKey">0162-8828:2006:Marinai S:font:adaptive:word</idno>
<idno type="wicri:Area/Main/Merge">000F78</idno>
<idno type="wicri:Area/Main/Curation">001167</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Font adaptive word indexing of modern printed documents</title>
<author>
<name sortKey="Marinai, Simone" sort="Marinai, Simone" uniqKey="Marinai S" first="Simone" last="Marinai">Simone Marinai</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marino, Emanuele" sort="Marino, Emanuele" uniqKey="Marino E" first="Emanuele" last="Marino">Emanuele Marino</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Soda, Giovanni" sort="Soda, Giovanni" uniqKey="Soda G" first="Giovanni" last="Soda">Giovanni Soda</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">IEEE transactions on pattern analysis and machine intelligence</title>
<title level="j" type="abbreviated">IEEE trans. pattern anal. mach. intell.</title>
<idno type="ISSN">0162-8828</idno>
<imprint>
<date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">IEEE transactions on pattern analysis and machine intelligence</title>
<title level="j" type="abbreviated">IEEE trans. pattern anal. mach. intell.</title>
<idno type="ISSN">0162-8828</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Abstracting and Indexing as Topic (methods)</term>
<term>Algorithms</term>
<term>Artificial Intelligence</term>
<term>Automatic Data Processing (methods)</term>
<term>Character recognition</term>
<term>Computer Graphics</term>
<term>Database query</term>
<term>Document layout</term>
<term>Documentation (methods)</term>
<term>Electronic library</term>
<term>Image Enhancement (methods)</term>
<term>Image Interpretation, Computer-Assisted (methods)</term>
<term>Image processing</term>
<term>Image recognition</term>
<term>Information Storage and Retrieval (methods)</term>
<term>Information browsing</term>
<term>Information retrieval</term>
<term>Libraries, Digital</term>
<term>Metadata</term>
<term>Natural Language Processing</term>
<term>Pattern Recognition, Automated (methods)</term>
<term>Pattern analysis</term>
<term>Pattern recognition</term>
<term>Publishing</term>
<term>Reproducibility of Results</term>
<term>Self-organising feature maps</term>
<term>Semantics</term>
<term>Sensitivity and Specificity</term>
<term>Signal Processing, Computer-Assisted</term>
<term>Subtraction Technique</term>
<term>User-Computer Interface</term>
<term>Vocabulary, Controlled</term>
<term>World wide web</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Abstracting and Indexing as Topic</term>
<term>Automatic Data Processing</term>
<term>Documentation</term>
<term>Image Enhancement</term>
<term>Image Interpretation, Computer-Assisted</term>
<term>Information Storage and Retrieval</term>
<term>Pattern Recognition, Automated</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Artificial Intelligence</term>
<term>Computer Graphics</term>
<term>Libraries, Digital</term>
<term>Natural Language Processing</term>
<term>Publishing</term>
<term>Reproducibility of Results</term>
<term>Semantics</term>
<term>Sensitivity and Specificity</term>
<term>Signal Processing, Computer-Assisted</term>
<term>Subtraction Technique</term>
<term>User-Computer Interface</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance image</term>
<term>Analyse forme</term>
<term>Traitement image</term>
<term>Présentation document</term>
<term>Réseau web</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance caractère</term>
<term>Métadonnée</term>
<term>Navigation information</term>
<term>Bibliothèque électronique</term>
<term>Interrogation base donnée</term>
<term>Recherche information</term>
<term>Carte autoorganisatrice</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.</div>
</front>
</TEI>
<double idat="0162-8828:2006:Marinai S:font:adaptive:word">
<INIST>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Font adaptive word indexing of modern printed documents</title>
<author>
<name sortKey="Marinai, Simone" sort="Marinai, Simone" uniqKey="Marinai S" first="Simone" last="Marinai">Simone Marinai</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marino, Emanuele" sort="Marino, Emanuele" uniqKey="Marino E" first="Emanuele" last="Marino">Emanuele Marino</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Soda, Giovanni" sort="Soda, Giovanni" uniqKey="Soda G" first="Giovanni" last="Soda">Giovanni Soda</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">06-0386778</idno>
<date when="2006">2006</date>
<idno type="stanalyst">PASCAL 06-0386778 INIST</idno>
<idno type="RBID">Pascal:06-0386778</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000377</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000409</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000335</idno>
<idno type="wicri:doubleKey">0162-8828:2006:Marinai S:font:adaptive:word</idno>
<idno type="wicri:Area/Main/Merge">001196</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Font adaptive word indexing of modern printed documents</title>
<author>
<name sortKey="Marinai, Simone" sort="Marinai, Simone" uniqKey="Marinai S" first="Simone" last="Marinai">Simone Marinai</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marino, Emanuele" sort="Marino, Emanuele" uniqKey="Marino E" first="Emanuele" last="Marino">Emanuele Marino</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Soda, Giovanni" sort="Soda, Giovanni" uniqKey="Soda G" first="Giovanni" last="Soda">Giovanni Soda</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Dipartimento di Sistemi e Informatica, Università di Firenze, via di S. Marta, 3</s1>
<s2>50139 Firenze</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">IEEE transactions on pattern analysis and machine intelligence</title>
<title level="j" type="abbreviated">IEEE trans. pattern anal. mach. intell.</title>
<idno type="ISSN">0162-8828</idno>
<imprint>
<date when="2006">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">IEEE transactions on pattern analysis and machine intelligence</title>
<title level="j" type="abbreviated">IEEE trans. pattern anal. mach. intell.</title>
<idno type="ISSN">0162-8828</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Database query</term>
<term>Document layout</term>
<term>Electronic library</term>
<term>Image processing</term>
<term>Image recognition</term>
<term>Information browsing</term>
<term>Information retrieval</term>
<term>Metadata</term>
<term>Pattern analysis</term>
<term>Pattern recognition</term>
<term>Self-organising feature maps</term>
<term>World wide web</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance image</term>
<term>Analyse forme</term>
<term>Traitement image</term>
<term>Présentation document</term>
<term>Réseau web</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance caractère</term>
<term>Métadonnée</term>
<term>Navigation information</term>
<term>Bibliothèque électronique</term>
<term>Interrogation base donnée</term>
<term>Recherche information</term>
<term>Carte autoorganisatrice</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.</div>
</front>
</TEI>
</INIST>
<PubMed>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Font adaptive word indexing of modern printed documents.</title>
<author>
<name sortKey="Marinai, Simone" sort="Marinai, Simone" uniqKey="Marinai S" first="Simone" last="Marinai">Simone Marinai</name>
<affiliation wicri:level="1">
<nlm:affiliation>Dipartimento di Sistenmi e Informatica, Università di Firenze, via di S. Marta, 3, 50139 Firenze, Italy. marinai@dsi.unifi.it</nlm:affiliation>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>Dipartimento di Sistenmi e Informatica, Università di Firenze, via di S. Marta, 3, 50139 Firenze</wicri:regionArea>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marino, Emanuele" sort="Marino, Emanuele" uniqKey="Marino E" first="Emanuele" last="Marino">Emanuele Marino</name>
</author>
<author>
<name sortKey="Soda, Giovanni" sort="Soda, Giovanni" uniqKey="Soda G" first="Giovanni" last="Soda">Giovanni Soda</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2006">2006</date>
<idno type="RBID">pubmed:16886856</idno>
<idno type="pmid">16886856</idno>
<idno type="doi">10.1109/TPAMI.2006.162</idno>
<idno type="wicri:Area/PubMed/Corpus">000063</idno>
<idno type="wicri:Area/PubMed/Curation">000063</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000063</idno>
<idno type="wicri:Area/Ncbi/Merge">000030</idno>
<idno type="wicri:Area/Ncbi/Curation">000030</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000030</idno>
<idno type="wicri:doubleKey">0162-8828:2006:Marinai S:font:adaptive:word</idno>
<idno type="wicri:Area/Main/Merge">000F78</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Font adaptive word indexing of modern printed documents.</title>
<author>
<name sortKey="Marinai, Simone" sort="Marinai, Simone" uniqKey="Marinai S" first="Simone" last="Marinai">Simone Marinai</name>
<affiliation wicri:level="1">
<nlm:affiliation>Dipartimento di Sistenmi e Informatica, Università di Firenze, via di S. Marta, 3, 50139 Firenze, Italy. marinai@dsi.unifi.it</nlm:affiliation>
<country xml:lang="fr">Italie</country>
<wicri:regionArea>Dipartimento di Sistenmi e Informatica, Università di Firenze, via di S. Marta, 3, 50139 Firenze</wicri:regionArea>
<wicri:noRegion>50139 Firenze</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Marino, Emanuele" sort="Marino, Emanuele" uniqKey="Marino E" first="Emanuele" last="Marino">Emanuele Marino</name>
</author>
<author>
<name sortKey="Soda, Giovanni" sort="Soda, Giovanni" uniqKey="Soda G" first="Giovanni" last="Soda">Giovanni Soda</name>
</author>
</analytic>
<series>
<title level="j">IEEE transactions on pattern analysis and machine intelligence</title>
<idno type="ISSN">0162-8828</idno>
<imprint>
<date when="2006" type="published">2006</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Abstracting and Indexing as Topic (methods)</term>
<term>Algorithms</term>
<term>Artificial Intelligence</term>
<term>Automatic Data Processing (methods)</term>
<term>Computer Graphics</term>
<term>Documentation (methods)</term>
<term>Image Enhancement (methods)</term>
<term>Image Interpretation, Computer-Assisted (methods)</term>
<term>Information Storage and Retrieval (methods)</term>
<term>Libraries, Digital</term>
<term>Natural Language Processing</term>
<term>Pattern Recognition, Automated (methods)</term>
<term>Publishing</term>
<term>Reproducibility of Results</term>
<term>Semantics</term>
<term>Sensitivity and Specificity</term>
<term>Signal Processing, Computer-Assisted</term>
<term>Subtraction Technique</term>
<term>User-Computer Interface</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Abstracting and Indexing as Topic</term>
<term>Automatic Data Processing</term>
<term>Documentation</term>
<term>Image Enhancement</term>
<term>Image Interpretation, Computer-Assisted</term>
<term>Information Storage and Retrieval</term>
<term>Pattern Recognition, Automated</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Artificial Intelligence</term>
<term>Computer Graphics</term>
<term>Libraries, Digital</term>
<term>Natural Language Processing</term>
<term>Publishing</term>
<term>Reproducibility of Results</term>
<term>Semantics</term>
<term>Sensitivity and Specificity</term>
<term>Signal Processing, Computer-Assisted</term>
<term>Subtraction Technique</term>
<term>User-Computer Interface</term>
<term>Vocabulary, Controlled</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We propose an approach for the word-level indexing of modern printed documents which are difficult to recognize using current OCR engines. By means of word-level indexing, it is possible to retrieve the position of words in a document, enabling queries involving proximity of terms. Web search engines implement this kind of indexing, allowing users to retrieve Web pages on the basis of their textual content. Nowadays, digital libraries hold collections of digitized documents that can be retrieved either by browsing the document images or relying on appropriate metadata assembled by domain experts. Word indexing tools would therefore increase the access to these collections. The proposed system is designed to index homogeneous document collections by automatically adapting to different languages and font styles without relying on OCR engines for character recognition. The approach is based on three main ideas: the use of Self Organizing Maps (SOM) to perform unsupervised character clustering, the definition of one suitable vector-based word representation whose size depends on the word aspect-ratio, and the run-time alignment of the query word with indexed words to deal with broken and touching characters. The most appropriate applications are for processing modern printed documents (17th to 19th centuries) where current OCR engines are less accurate. Our experimental analysis addresses six data sets containing documents ranging from books of the 17th century to contemporary journals.</div>
</front>
</TEI>
</PubMed>
</double>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001167 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Curation/biblio.hfd -nk 001167 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Curation
   |type=    RBID
   |clé=     Pascal:06-0386778
   |texte=   Font adaptive word indexing of modern printed documents
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024