Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis
Identifieur interne :
000752 ( PascalFrancis/Curation );
précédent :
000751;
suivant :
000753
Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis
Auteurs : Jean-Yves Ramel [
France] ;
Nicolas Sidere [
France] ;
Frédéric Rayar [
France]
Source :
-
Literary and linguistic computing [ 0268-1145 ] ; 2013.
RBID : Francis:14-0182616
Descripteurs français
English descriptors
Abstract
This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.
pA |
A01 | 01 | 1 | | @0 0268-1145 |
---|
A03 | | 1 | | @0 Lit. linguist. comput. |
---|
A05 | | | | @2 28 |
---|
A06 | | | | @2 2 |
---|
A08 | 01 | 1 | ENG | @1 Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis |
---|
A09 | 01 | 1 | ENG | @1 Digital Humanities 2011: Big Tent Digital Humanities |
---|
A11 | 01 | 1 | | @1 RAMEL (Jean-Yves) |
---|
A11 | 02 | 1 | | @1 SIDERE (Nicolas) |
---|
A11 | 03 | 1 | | @1 RAYAR (Frédéric) |
---|
A12 | 01 | 1 | | @1 WALTER (Katherine) @9 ed. |
---|
A12 | 02 | 1 | | @1 JOCKERS (Matt) @9 ed. |
---|
A12 | 03 | 1 | | @1 WORTHEY (Glen) @9 ed. |
---|
A14 | 01 | | | @1 Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours @3 FRA @Z 1 aut. @Z 2 aut. @Z 3 aut. |
---|
A15 | 01 | | | @1 Center for Digital Research in the Humanities, University of Nebraska-Lincoln @3 INC @Z 1 aut. |
---|
A20 | | | | @1 301-314 |
---|
A21 | | | | @1 2013 |
---|
A23 | 01 | | | @0 ENG |
---|
A43 | 01 | | | @1 INIST @2 23967 @5 354000503009700130 |
---|
A44 | | | | @0 0000 @1 © 2014 INIST-CNRS. All rights reserved. |
---|
A45 | | | | @0 3/4 p. |
---|
A47 | 01 | 1 | | @0 14-0182616 |
---|
A60 | | | | @1 P @2 C |
---|
A61 | | | | @0 A |
---|
A64 | 01 | 1 | | @0 Literary and linguistic computing |
---|
A66 | 01 | | | @0 GBR |
---|
C01 | 01 | | ENG | @0 This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows. |
---|
C02 | 01 | L | | @0 52478 @1 XV |
---|
C02 | 02 | L | | @0 524 |
---|
C03 | 01 | L | FRE | @0 Linguistique informatique @2 NI @5 01 |
---|
C03 | 01 | L | ENG | @0 Computational linguistics @2 NI @5 01 |
---|
C03 | 02 | L | FRE | @0 Extraction @2 NI @5 02 |
---|
C03 | 02 | L | ENG | @0 Extraction @2 NI @5 02 |
---|
C03 | 03 | L | FRE | @0 Texte @2 NI @5 03 |
---|
C03 | 03 | L | ENG | @0 Text @2 NI @5 03 |
---|
C03 | 04 | X | FRE | @0 Représentation graphique @5 04 |
---|
C03 | 04 | X | ENG | @0 Graphics @5 04 |
---|
C03 | 04 | X | SPA | @0 Grafo (curva) @5 04 |
---|
C03 | 05 | X | FRE | @0 Reconnaissance optique caractère @5 05 |
---|
C03 | 05 | X | ENG | @0 Optical character recognition @5 05 |
---|
C03 | 05 | X | SPA | @0 Reconocimento óptico de caracteres @5 05 |
---|
C03 | 06 | L | FRE | @0 Reconnaissance automatique @2 NI @5 06 |
---|
C03 | 06 | L | ENG | @0 Automatic recognition @2 NI @5 06 |
---|
C03 | 07 | X | FRE | @0 Bibliothèque électronique @5 07 |
---|
C03 | 07 | X | ENG | @0 Electronic library @5 07 |
---|
C03 | 07 | X | SPA | @0 Biblioteca electronica @5 07 |
---|
C03 | 08 | X | FRE | @0 Archivage électronique @5 08 |
---|
C03 | 08 | X | ENG | @0 Electronic storage @5 08 |
---|
C03 | 08 | X | SPA | @0 Archivo electrónico @5 08 |
---|
C03 | 09 | L | FRE | @0 Humanités numériques @4 INC @5 31 |
---|
N21 | | | | @1 230 |
---|
|
pR |
A30 | 01 | 1 | ENG | @1 Digital Humanities conference @3 USA @4 2011-06-19 |
---|
|
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000035
Links to Exploration step
Francis:14-0182616
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</title>
<author><name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidere">Nicolas Sidere</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Rayar, Frederic" sort="Rayar, Frederic" uniqKey="Rayar F" first="Frédéric" last="Rayar">Frédéric Rayar</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">14-0182616</idno>
<date when="2013">2013</date>
<idno type="stanalyst">FRANCIS 14-0182616 INIST</idno>
<idno type="RBID">Francis:14-0182616</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000035</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000752</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</title>
<author><name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidere">Nicolas Sidere</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author><name sortKey="Rayar, Frederic" sort="Rayar, Frederic" uniqKey="Rayar F" first="Frédéric" last="Rayar">Frédéric Rayar</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Literary and linguistic computing</title>
<title level="j" type="abbreviated">Lit. linguist. comput.</title>
<idno type="ISSN">0268-1145</idno>
<imprint><date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Literary and linguistic computing</title>
<title level="j" type="abbreviated">Lit. linguist. comput.</title>
<idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic recognition</term>
<term>Computational linguistics</term>
<term>Electronic library</term>
<term>Electronic storage</term>
<term>Extraction</term>
<term>Graphics</term>
<term>Optical character recognition</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Linguistique informatique</term>
<term>Extraction</term>
<term>Texte</term>
<term>Représentation graphique</term>
<term>Reconnaissance optique caractère</term>
<term>Reconnaissance automatique</term>
<term>Bibliothèque électronique</term>
<term>Archivage électronique</term>
<term>Humanités numériques</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.</div>
</front>
</TEI>
<inist><standard h6="B"><pA><fA01 i1="01" i2="1"><s0>0268-1145</s0>
</fA01>
<fA03 i2="1"><s0>Lit. linguist. comput.</s0>
</fA03>
<fA08 i1="01" i2="1" l="ENG"><s1>Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG"><s1>Digital Humanities 2011: Big Tent Digital Humanities</s1>
</fA09>
<fA11 i1="01" i2="1"><s1>RAMEL (Jean-Yves)</s1>
</fA11>
<fA11 i1="02" i2="1"><s1>SIDERE (Nicolas)</s1>
</fA11>
<fA11 i1="03" i2="1"><s1>RAYAR (Frédéric)</s1>
</fA11>
<fA12 i1="01" i2="1"><s1>WALTER (Katherine)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1"><s1>JOCKERS (Matt)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="03" i2="1"><s1>WORTHEY (Glen)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01"><s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA15 i1="01"><s1>Center for Digital Research in the Humanities, University of Nebraska-Lincoln</s1>
<s3>INC</s3>
<sZ>1 aut.</sZ>
</fA15>
<fA20><s1>301-314</s1>
</fA20>
<fA21><s1>2013</s1>
</fA21>
<fA23 i1="01"><s0>ENG</s0>
</fA23>
<fA43 i1="01"><s1>INIST</s1>
<s2>23967</s2>
<s5>354000503009700130</s5>
</fA43>
<fA44><s0>0000</s0>
<s1>© 2014 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45><s0>3/4 p.</s0>
</fA45>
<fA47 i1="01" i2="1"><s0>14-0182616</s0>
</fA47>
<fA60><s1>P</s1>
<s2>C</s2>
</fA60>
<fA64 i1="01" i2="1"><s0>Literary and linguistic computing</s0>
</fA64>
<fA66 i1="01"><s0>GBR</s0>
</fA66>
<fC01 i1="01" l="ENG"><s0>This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.</s0>
</fC01>
<fC02 i1="01" i2="L"><s0>52478</s0>
<s1>XV</s1>
</fC02>
<fC02 i1="02" i2="L"><s0>524</s0>
</fC02>
<fC03 i1="01" i2="L" l="FRE"><s0>Linguistique informatique</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="L" l="ENG"><s0>Computational linguistics</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="L" l="FRE"><s0>Extraction</s0>
<s2>NI</s2>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="L" l="ENG"><s0>Extraction</s0>
<s2>NI</s2>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="L" l="FRE"><s0>Texte</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="L" l="ENG"><s0>Text</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE"><s0>Représentation graphique</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG"><s0>Graphics</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA"><s0>Grafo (curva)</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE"><s0>Reconnaissance optique caractère</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG"><s0>Optical character recognition</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA"><s0>Reconocimento óptico de caracteres</s0>
<s5>05</s5>
</fC03>
<fC03 i1="06" i2="L" l="FRE"><s0>Reconnaissance automatique</s0>
<s2>NI</s2>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="L" l="ENG"><s0>Automatic recognition</s0>
<s2>NI</s2>
<s5>06</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE"><s0>Bibliothèque électronique</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG"><s0>Electronic library</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA"><s0>Biblioteca electronica</s0>
<s5>07</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE"><s0>Archivage électronique</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG"><s0>Electronic storage</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA"><s0>Archivo electrónico</s0>
<s5>08</s5>
</fC03>
<fC03 i1="09" i2="L" l="FRE"><s0>Humanités numériques</s0>
<s4>INC</s4>
<s5>31</s5>
</fC03>
<fN21><s1>230</s1>
</fN21>
</pA>
<pR><fA30 i1="01" i2="1" l="ENG"><s1>Digital Humanities conference</s1>
<s3>USA</s3>
<s4>2011-06-19</s4>
</fA30>
</pR>
</standard>
</inist>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000752 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000752 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien
|wiki= Ticri/CIDE
|area= OcrV1
|flux= PascalFrancis
|étape= Curation
|type= RBID
|clé= Francis:14-0182616
|texte= Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis
}}
| This area was generated with Dilib version V0.6.32. Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024 | |