A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
Identifieur interne : 000307 ( PascalFrancis/Corpus ); précédent : 000306; suivant : 000308A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
Auteurs : SHAOLEI FENG ; R. ManmathaSource :
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.
Notice en format standard (ISO 2709)
Pour connaître la documentation sur le format Inist Standard.
pA |
|
---|
Format Inist (serveur)
NO : | FRANCIS 08-0091673 INIST |
---|---|
ET : | A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books |
AU : | SHAOLEI FENG; MANMATHA (R.) |
AF : | Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts/Amherst/Etats-Unis (1 aut., 2 aut.) |
DT : | Congrès; Niveau analytique |
SO : | ACM/IEEE Joint Conference on Digital Libraries/6/2006/Chapel Hill NC USA; Etats-Unis; New York NY: ACM Press; Da. 2006; Pp. 109-118; ISBN 1-59593-354-9 |
LA : | Anglais |
EA : | A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results. |
CC : | 790B05 |
FD : | Traitement automatique; Evaluation performance; Reconnaissance optique caractère; Etude utilisation; Bibliothèque électronique; Résultat |
ED : | Automatic processing; Performance evaluation; Optical character recognition; Use study; Electronic library; Result |
SD : | Tratamiento automático; Evaluación prestación; Reconocimento óptico de caracteres; Estudio utilización; Biblioteca electronica; Resultado |
LO : | INIST-Y 38968.354000153512330170 |
ID : | 08-0091673 |
Links to Exploration step
Francis:08-0091673Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</title>
<author><name sortKey="Shaolei Feng" sort="Shaolei Feng" uniqKey="Shaolei Feng" last="Shaolei Feng">SHAOLEI FENG</name>
<affiliation><inist:fA14 i1="01"><s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
<affiliation><inist:fA14 i1="01"><s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">08-0091673</idno>
<date when="2006">2006</date>
<idno type="stanalyst">FRANCIS 08-0091673 INIST</idno>
<idno type="RBID">Francis:08-0091673</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000307</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</title>
<author><name sortKey="Shaolei Feng" sort="Shaolei Feng" uniqKey="Shaolei Feng" last="Shaolei Feng">SHAOLEI FENG</name>
<affiliation><inist:fA14 i1="01"><s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author><name sortKey="Manmatha, R" sort="Manmatha, R" uniqKey="Manmatha R" first="R." last="Manmatha">R. Manmatha</name>
<affiliation><inist:fA14 i1="01"><s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic processing</term>
<term>Electronic library</term>
<term>Optical character recognition</term>
<term>Performance evaluation</term>
<term>Result</term>
<term>Use study</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Traitement automatique</term>
<term>Evaluation performance</term>
<term>Reconnaissance optique caractère</term>
<term>Etude utilisation</term>
<term>Bibliothèque électronique</term>
<term>Résultat</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA08 i1="01" i2="1" l="ENG"><s1>A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>6th ACM/IEEE-CS Joint Conference on Digital Libraries 2006 : opening information horizons : June 11-15, 2006, Chapel Hill NC</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>SHAOLEI FENG</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>MANMATHA (R.)</s1>
</fA11>
<fA14 i1="01"><s1>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts</s1>
<s2>Amherst</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</fA14>
<fA18 i1="01" i2="1"><s1>Association for Computing Machinery. Special Interest Group on Information Retrieval</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="02" i2="1"><s1>Association for Computing Machinery. Special Interest Group on Hypertext, Hypermedia and Web</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA18 i1="03" i2="1"><s1>IEEE Computer Society. Technical Committee on Digital Libraries</s1>
<s3>USA</s3>
<s9>org-cong.</s9>
</fA18>
<fA20><s1>109-118</s1>
</fA20>
<fA21><s1>2006</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA25 i1="01"><s1>ACM Press</s1>
<s2>New York NY</s2>
</fA25>
<fA26 i1="01"><s0>1-59593-354-9</s0>
</fA26>
<fA30 i1="01" i2="1" l="ENG"><s1>ACM/IEEE Joint Conference on Digital Libraries</s1>
<s2>6</s2>
<s3>Chapel Hill NC USA</s3>
<s4>2006</s4>
</fA30>
<fA43 i1="01"><s1>INIST</s1>
<s2>Y 38968</s2>
<s5>354000153512330170</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2008 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>21 ref.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>08-0091673</s0>
</fA47>
<fA60><s1>C</s1>
</fA60>
<fA61><s0>A</s0>
</fA61>
<fA66 i1="01"><s0>USA</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.</s0>
</fC01>
<fC02 i1="01" i2="X"><s0>790B05</s0>
<s1>II</s1>
</fC02>
<fC03 i1="01" i2="X" l="FRE"><s0>Traitement automatique</s0>
<s5>04</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG"><s0>Automatic processing</s0>
<s5>04</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA"><s0>Tratamiento automático</s0>
<s5>04</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE"><s0>Evaluation performance</s0>
<s5>05</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG"><s0>Performance evaluation</s0>
<s5>05</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA"><s0>Evaluación prestación</s0>
<s5>05</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>06</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>06</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Etude utilisation</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Use study</s0>
<s5>07</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Estudio utilización</s0>
<s5>07</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Bibliothèque électronique</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Electronic library</s0>
<s5>08</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Biblioteca electronica</s0>
<s5>08</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE"><s0>Résultat</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG"><s0>Result</s0>
<s5>09</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA"><s0>Resultado</s0>
<s5>09</s5>
</fC03>
<fN21><s1>052</s1>
</fN21>
</pA>
</standard>
<server><NO>FRANCIS 08-0091673 INIST</NO>
<ET>A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books</ET>
<AU>SHAOLEI FENG; MANMATHA (R.)</AU>
<AF>Multimedia Indexing and Retrieval Group Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts/Amherst/Etats-Unis (1 aut., 2 aut.)</AF>
<DT>Congrès; Niveau analytique</DT>
<SO>ACM/IEEE Joint Conference on Digital Libraries/6/2006/Chapel Hill NC USA; Etats-Unis; New York NY: ACM Press; Da. 2006; Pp. 109-118; ISBN 1-59593-354-9</SO>
<LA>Anglais</LA>
<EA>A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.</EA>
<CC>790B05</CC>
<FD>Traitement automatique; Evaluation performance; Reconnaissance optique caractère; Etude utilisation; Bibliothèque électronique; Résultat</FD>
<ED>Automatic processing; Performance evaluation; Optical character recognition; Use study; Electronic library; Result</ED>
<SD>Tratamiento automático; Evaluación prestación; Reconocimento óptico de caracteres; Estudio utilización; Biblioteca electronica; Resultado</SD>
<LO>INIST-Y 38968.354000153512330170</LO>
<ID>08-0091673</ID>
</server>
</inist>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000307 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000307 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= PascalFrancis |étape= Corpus |type= RBID |clé= Francis:08-0091673 |texte= A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books }}
![]() | This area was generated with Dilib version V0.6.32. | ![]() |