Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis

Identifieur interne : 000752 ( PascalFrancis/Curation ); précédent : 000751; suivant : 000753

Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis

Auteurs : Jean-Yves Ramel [France] ; Nicolas Sidere [France] ; Frédéric Rayar [France]

Source :

RBID : Francis:14-0182616

Descripteurs français

English descriptors

Abstract

This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.
pA  
A01 01  1    @0 0268-1145
A03   1    @0 Lit. linguist. comput.
A05       @2 28
A06       @2 2
A08 01  1  ENG  @1 Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis
A09 01  1  ENG  @1 Digital Humanities 2011: Big Tent Digital Humanities
A11 01  1    @1 RAMEL (Jean-Yves)
A11 02  1    @1 SIDERE (Nicolas)
A11 03  1    @1 RAYAR (Frédéric)
A12 01  1    @1 WALTER (Katherine) @9 ed.
A12 02  1    @1 JOCKERS (Matt) @9 ed.
A12 03  1    @1 WORTHEY (Glen) @9 ed.
A14 01      @1 Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours @3 FRA @Z 1 aut. @Z 2 aut. @Z 3 aut.
A15 01      @1 Center for Digital Research in the Humanities, University of Nebraska-Lincoln @3 INC @Z 1 aut.
A20       @1 301-314
A21       @1 2013
A23 01      @0 ENG
A43 01      @1 INIST @2 23967 @5 354000503009700130
A44       @0 0000 @1 © 2014 INIST-CNRS. All rights reserved.
A45       @0 3/4 p.
A47 01  1    @0 14-0182616
A60       @1 P @2 C
A61       @0 A
A64 01  1    @0 Literary and linguistic computing
A66 01      @0 GBR
C01 01    ENG  @0 This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.
C02 01  L    @0 52478 @1 XV
C02 02  L    @0 524
C03 01  L  FRE  @0 Linguistique informatique @2 NI @5 01
C03 01  L  ENG  @0 Computational linguistics @2 NI @5 01
C03 02  L  FRE  @0 Extraction @2 NI @5 02
C03 02  L  ENG  @0 Extraction @2 NI @5 02
C03 03  L  FRE  @0 Texte @2 NI @5 03
C03 03  L  ENG  @0 Text @2 NI @5 03
C03 04  X  FRE  @0 Représentation graphique @5 04
C03 04  X  ENG  @0 Graphics @5 04
C03 04  X  SPA  @0 Grafo (curva) @5 04
C03 05  X  FRE  @0 Reconnaissance optique caractère @5 05
C03 05  X  ENG  @0 Optical character recognition @5 05
C03 05  X  SPA  @0 Reconocimento óptico de caracteres @5 05
C03 06  L  FRE  @0 Reconnaissance automatique @2 NI @5 06
C03 06  L  ENG  @0 Automatic recognition @2 NI @5 06
C03 07  X  FRE  @0 Bibliothèque électronique @5 07
C03 07  X  ENG  @0 Electronic library @5 07
C03 07  X  SPA  @0 Biblioteca electronica @5 07
C03 08  X  FRE  @0 Archivage électronique @5 08
C03 08  X  ENG  @0 Electronic storage @5 08
C03 08  X  SPA  @0 Archivo electrónico @5 08
C03 09  L  FRE  @0 Humanités numériques @4 INC @5 31
N21       @1 230
pR  
A30 01  1  ENG  @1 Digital Humanities conference @3 USA @4 2011-06-19

Links toward previous steps (curation, corpus...)


Links to Exploration step

Francis:14-0182616

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</title>
<author>
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidere">Nicolas Sidere</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Rayar, Frederic" sort="Rayar, Frederic" uniqKey="Rayar F" first="Frédéric" last="Rayar">Frédéric Rayar</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">14-0182616</idno>
<date when="2013">2013</date>
<idno type="stanalyst">FRANCIS 14-0182616 INIST</idno>
<idno type="RBID">Francis:14-0182616</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000035</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000752</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</title>
<author>
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidere">Nicolas Sidere</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Rayar, Frederic" sort="Rayar, Frederic" uniqKey="Rayar F" first="Frédéric" last="Rayar">Frédéric Rayar</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>France</country>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Literary and linguistic computing</title>
<title level="j" type="abbreviated">Lit. linguist. comput.</title>
<idno type="ISSN">0268-1145</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Literary and linguistic computing</title>
<title level="j" type="abbreviated">Lit. linguist. comput.</title>
<idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automatic recognition</term>
<term>Computational linguistics</term>
<term>Electronic library</term>
<term>Electronic storage</term>
<term>Extraction</term>
<term>Graphics</term>
<term>Optical character recognition</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Linguistique informatique</term>
<term>Extraction</term>
<term>Texte</term>
<term>Représentation graphique</term>
<term>Reconnaissance optique caractère</term>
<term>Reconnaissance automatique</term>
<term>Bibliothèque électronique</term>
<term>Archivage électronique</term>
<term>Humanités numériques</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0268-1145</s0>
</fA01>
<fA03 i2="1">
<s0>Lit. linguist. comput.</s0>
</fA03>
<fA05>
<s2>28</s2>
</fA05>
<fA06>
<s2>2</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG">
<s1>Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis</s1>
</fA08>
<fA09 i1="01" i2="1" l="ENG">
<s1>Digital Humanities 2011: Big Tent Digital Humanities</s1>
</fA09>
<fA11 i1="01" i2="1">
<s1>RAMEL (Jean-Yves)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>SIDERE (Nicolas)</s1>
</fA11>
<fA11 i1="03" i2="1">
<s1>RAYAR (Frédéric)</s1>
</fA11>
<fA12 i1="01" i2="1">
<s1>WALTER (Katherine)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="02" i2="1">
<s1>JOCKERS (Matt)</s1>
<s9>ed.</s9>
</fA12>
<fA12 i1="03" i2="1">
<s1>WORTHEY (Glen)</s1>
<s9>ed.</s9>
</fA12>
<fA14 i1="01">
<s1>Laboratoire d'Informatique de Tours (EA 6300) Ecole d'ingénieurs Polytechnique de l'Université de Tours</s1>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</fA14>
<fA15 i1="01">
<s1>Center for Digital Research in the Humanities, University of Nebraska-Lincoln</s1>
<s3>INC</s3>
<sZ>1 aut.</sZ>
</fA15>
<fA20>
<s1>301-314</s1>
</fA20>
<fA21>
<s1>2013</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA43 i1="01">
<s1>INIST</s1>
<s2>23967</s2>
<s5>354000503009700130</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2014 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>3/4 p.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>14-0182616</s0>
</fA47>
<fA60>
<s1>P</s1>
<s2>C</s2>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Literary and linguistic computing</s0>
</fA64>
<fA66 i1="01">
<s0>GBR</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.</s0>
</fC01>
<fC02 i1="01" i2="L">
<s0>52478</s0>
<s1>XV</s1>
</fC02>
<fC02 i1="02" i2="L">
<s0>524</s0>
</fC02>
<fC03 i1="01" i2="L" l="FRE">
<s0>Linguistique informatique</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="L" l="ENG">
<s0>Computational linguistics</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="L" l="FRE">
<s0>Extraction</s0>
<s2>NI</s2>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="L" l="ENG">
<s0>Extraction</s0>
<s2>NI</s2>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="L" l="FRE">
<s0>Texte</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="L" l="ENG">
<s0>Text</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Représentation graphique</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Graphics</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Grafo (curva)</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Reconnaissance optique caractère</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Optical character recognition</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Reconocimento óptico de caracteres</s0>
<s5>05</s5>
</fC03>
<fC03 i1="06" i2="L" l="FRE">
<s0>Reconnaissance automatique</s0>
<s2>NI</s2>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="L" l="ENG">
<s0>Automatic recognition</s0>
<s2>NI</s2>
<s5>06</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE">
<s0>Bibliothèque électronique</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG">
<s0>Electronic library</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA">
<s0>Biblioteca electronica</s0>
<s5>07</s5>
</fC03>
<fC03 i1="08" i2="X" l="FRE">
<s0>Archivage électronique</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="ENG">
<s0>Electronic storage</s0>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="X" l="SPA">
<s0>Archivo electrónico</s0>
<s5>08</s5>
</fC03>
<fC03 i1="09" i2="L" l="FRE">
<s0>Humanités numériques</s0>
<s4>INC</s4>
<s5>31</s5>
</fC03>
<fN21>
<s1>230</s1>
</fN21>
</pA>
<pR>
<fA30 i1="01" i2="1" l="ENG">
<s1>Digital Humanities conference</s1>
<s3>USA</s3>
<s4>2011-06-19</s4>
</fA30>
</pR>
</standard>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000752 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Curation/biblio.hfd -nk 000752 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Curation
   |type=    RBID
   |clé=     Francis:14-0182616
   |texte=   Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024