Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Identifieur interne : 000307 ( PascalFrancis/Corpus ); précédent : 000306; suivant : 000308

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Auteurs : SHAOLEI FENG ; R. Manmatha

Source :

RBID : Francis:08-0091673

Descripteurs français

English descriptors

Abstract

A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

pA  
A08 01  1  ENG  @1 A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
A09 01  1  ENG  @1 6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006 : opening information horizons : June 11-15, 2006, Chapel Hill NC
A11 01  1    @1 SHAOLEI FENG
A11 02  1    @1 MANMATHA (R.)
A14 01      @1 Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts @2 Amherst @3 USA @Z 1 aut. @Z 2 aut.
A18 01  1    @1 Association for Computing Machinery. Special Interest Group on Information Retrieval @3 USA @9 org-cong.
A18 02  1    @1 Association for Computing Machinery. Special Interest Group on Hypertext, Hypermedia and Web @3 USA @9 org-cong.
A18 03  1    @1 IEEE Computer Society. Technical Committee on Digital Libraries @3 USA @9 org-cong.
A20       @1 109-118
A21       @1 2006
A23 01      @0 ENG
A25 01      @1 ACM Press @2 New York NY
A26 01      @0 1-59593-354-9
A30 01  1  ENG  @1 ACM/IEEE Joint Conference on Digital Libraries @2 6 @3 Chapel Hill NC USA @4 2006
A43 01      @1 INIST @2 Y 38968 @5 354000153512330170
A44       @0 0000 @1 © 2008 INIST-CNRS. All rights reserved.
A45       @0 21 ref.
A47 01  1    @0 08-0091673
A60       @1 C
A61       @0 A
A66 01      @0 USA
C01 01    ENG  @0 A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.
C02 01  X    @0 790B05 @1 II
C03 01  X  FRE  @0 Traitement automatique @5 04
C03 01  X  ENG  @0 Automatic processing @5 04
C03 01  X  SPA  @0 Tratamiento automático @5 04
C03 02  X  FRE  @0 Evaluation performance @5 05
C03 02  X  ENG  @0 Performance evaluation @5 05
C03 02  X  SPA  @0 Evaluación prestación @5 05
C03 03  X  FRE  @0 Reconnaissance optique caractère @5 06
C03 03  X  ENG  @0 Optical character recognition @5 06
C03 03  X  SPA  @0 Reconocimento óptico de caracteres @5 06
C03 04  X  FRE  @0 Etude utilisation @5 07
C03 04  X  ENG  @0 Use study @5 07
C03 04  X  SPA  @0 Estudio utilización @5 07
C03 05  X  FRE  @0 Bibliothèque électronique @5 08
C03 05  X  ENG  @0 Electronic library @5 08
C03 05  X  SPA  @0 Biblioteca electronica @5 08
C03 06  X  FRE  @0 Résultat @5 09
C03 06  X  ENG  @0 Result @5 09
C03 06  X  SPA  @0 Resultado @5 09
N21       @1 052

Format Inist (serveur)

NO : FRANCIS 08-0091673 INIST
ET : A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
AU : SHAOLEI FENG; MANMATHA (R.)
AF : Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts/Amherst/Etats-Unis (1 aut., 2 aut.)
DT : Congrès; Niveau analytique
SO : ACM/IEEE Joint Conference on Digital Libraries/6/2006/Chapel Hill NC USA; Etats-Unis; New York NY: ACM Press; Da. 2006; Pp. 109-118; ISBN 1-59593-354-9
LA : Anglais
EA : A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.
CC : 790B05
FD : Traitement automatique; Evaluation performance; Reconnaissance optique caractère; Etude utilisation; Bibliothèque électronique; Résultat
ED : Automatic processing; Performance evaluation; Optical character recognition; Use study; Electronic library; Result
SD : Tratamiento automático; Evaluación prestación; Reconocimento óptico de caracteres; Estudio utilización; Biblioteca electronica; Resultado
LO : INIST-Y 38968.354000153512330170
ID : 08-0091673

Links to Exploration step

Francis:08-0091673

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</title>
<author>
<name sortKey="Shaolei Feng" sort="Shaolei Feng" uniqKey="Shaolei Feng" last="Shaolei Feng">SHAOLEI FENG</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">08-0091673</idno>
<date when="2006">2006</date>
<idno type="stanalyst">FRANCIS 08-0091673 INIST</idno>
<idno type="RBID">Francis:08-0091673</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000307</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</title>
<author>
<name sortKey="Shaolei Feng" sort="Shaolei Feng" uniqKey="Shaolei Feng" last="Shaolei Feng">SHAOLEI FENG</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
<affiliation>
<inist:fA14 i1="01">
<s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automatic processing</term>
<term>Electronic library</term>
<term>Optical character recognition</term>
<term>Performance evaluation</term>
<term>Result</term>
<term>Use study</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Traitement automatique</term>
<term>Evaluation performance</term>
<term>Reconnaissance optique caractère</term>
<term>Etude utilisation</term>
<term>Bibliothèque électronique</term>
<term>Résultat</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA08 i1="01" i2="1" l="ENG">
<s1>A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006 : opening information horizons : June 11-15, 2006, Chapel Hill NC</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>SHAOLEI FENG</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>MANMATHA (R.)</s1>
</fA11>
<fA14 i1="01">
<s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1">
<s1>Association for Computing Machinery. Special Interest Group on Information Retrieval</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="02" i2="1">
<s1>Association for Computing Machinery. Special Interest Group on Hypertext, Hypermedia and Web</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="03" i2="1">
<s1>IEEE Computer Society. Technical Committee on Digital Libraries</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20>
<s1>109-118</s1>
</fA20>
<fA21>
<s1>2006</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA25 i1="01">
<s1>ACM Press</s1>
<s2>New York NY</s2>
</fA25>
<fA26 i1="01">
<s0>1-59593-354-9</s0>
</fA26>
<fA30 i1="01" i2="1" l="ENG">
<s1>ACM/IEEE Joint Conference on Digital Libraries</s1>
<s2>6</s2>
<s3>Chapel Hill NC USA</s3>
<s4>2006</s4>
</fA30>
<fA43 i1="01">
<s1>INIST</s1>
<s2>Y 38968</s2>
<s5>354000153512330170</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2008 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>21 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>08-0091673</s0>
</fA47>
<fA60>
<s1>C</s1>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA66 i1="01">
<s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>790B05</s0>
<s1>II</s1>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Traitement automatique</s0>
<s5>04</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Automatic processing</s0>
<s5>04</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Tratamiento automático</s0>
<s5>04</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Evaluation performance</s0>
<s5>05</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Performance evaluation</s0>
<s5>05</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Evaluación prestación</s0>
<s5>05</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>06</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Etude utilisation</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Use study</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Estudio utilización</s0>
<s5>07</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Bibliothèque électronique</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Electronic library</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Biblioteca electronica</s0>
<s5>08</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE">
<s0>Résultat</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG">
<s0>Result</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA">
<s0>Resultado</s0>
<s5>09</s5>
</fC03>
<fN21>
<s1>052</s1>
</fN21>
</pA>
</standard>
<server>
<NO>FRANCIS 08-0091673 INIST</NO>
<ET>A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</ET>
<AU>SHAOLEI FENG; MANMATHA (R.)</AU>
<AF>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts/Amherst/Etats-Unis (1 aut., 2 aut.)</AF>
<DT>Congrès; Niveau analytique</DT>
<SO>ACM/IEEE Joint Conference on Digital Libraries/6/2006/Chapel Hill NC USA; Etats-Unis; New York NY: ACM Press; Da. 2006; Pp. 109-118; ISBN 1-59593-354-9</SO>
<LA>Anglais</LA>
<EA>A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.</EA>
<CC>790B05</CC>
<FD>Traitement automatique; Evaluation performance; Reconnaissance optique caractère; Etude utilisation; Bibliothèque électronique; Résultat</FD>
<ED>Automatic processing; Performance evaluation; Optical character recognition; Use study; Electronic library; Result</ED>
<SD>Tratamiento automático; Evaluación prestación; Reconocimento óptico de caracteres; Estudio utilización; Biblioteca electronica; Resultado</SD>
<LO>INIST-Y 38968.354000153512330170</LO>
<ID>08-0091673</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000307 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000307 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Francis:08-0091673
   |texte=   A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024