Serveur d'exploration sur la visibilité du Havre

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Adaptative quality control of digital documents in mass digitization projects

Identifieur interne : 000120 ( Hal/Checkpoint ); précédent : 000119; suivant : 000121

Adaptative quality control of digital documents in mass digitization projects

Auteurs : Ahmed Ben Salah [France]

Source :

RBID : Hal:tel-01164698

Descripteurs français

English descriptors

Abstract

This work focuses on the assessment of characters recognition results produced automatically by optical character recognition software (OCR on mass digitization projects. The goal is to design a global control system robust enough to deal with BnF documents collection. This collection includes old documents which are difficult to be treated by OCR. We designed a word detection system to detect missed words defects in OCR results, and a words recognition rate estimator to assess the quality of word recognition results performed by OCR.We create two kinds of descriptors to characterize OCR outputs. Image descriptors to characterize page segmentation results and cross alignment descriptors to characterize the quality of word recognition results. Furthermore, we adapt our learning process to make an adaptive decision or prediction systems. We evaluated our control systems on real images selected randomly from BnF collection. The mmissed word detection system detects 84.15% of words omitted by the OCR with a precision of 94.73%. The experiments performed also showed that 80% of the documents of word recognition rate less than 98% are detected with an accuracy of 92%. It can also automatically detect 45% of the material having a recognition rate less than 70% with greater than 92% accuracy.

Url:

Links toward previous steps (curation, corpus...)


Links to Exploration step

Hal:tel-01164698

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Adaptative quality control of digital documents in mass digitization projects</title>
<title xml:lang="fr">Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-399419" status="INCOMING">
<orgName>DocApp et Rfai</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-23832" type="direct"></relation>
<relation active="#struct-300317" type="indirect"></relation>
<relation name="EA4108" active="#struct-300318" type="indirect"></relation>
<relation active="#struct-301288" type="indirect"></relation>
<relation active="#struct-301232" type="indirect"></relation>
<relation active="#struct-203066" type="direct"></relation>
<relation active="#struct-302209" type="indirect"></relation>
<relation active="#struct-204893" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="indirect"></relation>
<relation active="#struct-300408" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-23832" type="direct">
<org type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300317" type="indirect">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="indirect">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="indirect">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-203066" type="direct">
<org type="laboratory" xml:id="struct-203066" status="VALID">
<orgName>Bibliothèque nationale de France, Délégation à la Stratégie et à la recherche</orgName>
<orgName type="acronym">BnF_DSG</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/la_bnf/strategie_recherche.html</ref>
</desc>
<listRelation>
<relation active="#struct-302209" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-302209" type="indirect">
<org type="institution" xml:id="struct-302209" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-204893" type="direct">
<org type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
<relation active="#struct-300408" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="indirect">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300408" type="indirect">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Le Havre</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université du Havre</orgName>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:tel-01164698</idno>
<idno type="halId">tel-01164698</idno>
<idno type="halUri">https://hal-bnf.archives-ouvertes.fr/tel-01164698</idno>
<idno type="url">https://hal-bnf.archives-ouvertes.fr/tel-01164698</idno>
<date when="2014-07-11">2014-07-11</date>
<idno type="wicri:Area/Hal/Corpus">000027</idno>
<idno type="wicri:Area/Hal/Curation">000027</idno>
<idno type="wicri:Area/Hal/Checkpoint">000120</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Adaptative quality control of digital documents in mass digitization projects</title>
<title xml:lang="fr">Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse</title>
<author>
<name sortKey="Ben Salah, Ahmed" sort="Ben Salah, Ahmed" uniqKey="Ben Salah A" first="Ahmed" last="Ben Salah">Ahmed Ben Salah</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-399419" status="INCOMING">
<orgName>DocApp et Rfai</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-23832" type="direct"></relation>
<relation active="#struct-300317" type="indirect"></relation>
<relation name="EA4108" active="#struct-300318" type="indirect"></relation>
<relation active="#struct-301288" type="indirect"></relation>
<relation active="#struct-301232" type="indirect"></relation>
<relation active="#struct-203066" type="direct"></relation>
<relation active="#struct-302209" type="indirect"></relation>
<relation active="#struct-204893" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="indirect"></relation>
<relation active="#struct-300408" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-23832" type="direct">
<org type="laboratory" xml:id="struct-23832" status="VALID">
<orgName>Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</orgName>
<orgName type="acronym">LITIS</orgName>
<desc>
<address>
<addrLine>Avenue de l'Université UFR des Sciences et Techniques 76800 Saint-Etienne du Rouvray</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.litislab.eu</ref>
</desc>
<listRelation>
<relation active="#struct-300317" type="direct"></relation>
<relation name="EA4108" active="#struct-300318" type="direct"></relation>
<relation active="#struct-301288" type="direct"></relation>
<relation active="#struct-301232" type="indirect"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300317" type="indirect">
<org type="institution" xml:id="struct-300317" status="VALID">
<orgName>Université du Havre</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA4108" active="#struct-300318" type="indirect">
<org type="institution" xml:id="struct-300318" status="VALID">
<orgName>Université de Rouen</orgName>
<desc>
<address>
<addrLine> 1 rue Thomas Becket - 76821 Mont-Saint-Aignan</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-rouen.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-301288" type="indirect">
<org type="department" xml:id="struct-301288" status="VALID">
<orgName>Institut National des Sciences Appliquées - Rouen</orgName>
<orgName type="acronym">INSA Rouen</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-301232" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-301232" type="indirect">
<org type="institution" xml:id="struct-301232" status="VALID">
<orgName>Institut National des Sciences Appliquées</orgName>
<orgName type="acronym">INSA</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-203066" type="direct">
<org type="laboratory" xml:id="struct-203066" status="VALID">
<orgName>Bibliothèque nationale de France, Délégation à la Stratégie et à la recherche</orgName>
<orgName type="acronym">BnF_DSG</orgName>
<desc>
<address>
<addrLine>Quai François Mauriac, 75706 Paris cedex 13</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.bnf.fr/fr/la_bnf/strategie_recherche.html</ref>
</desc>
<listRelation>
<relation active="#struct-302209" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-302209" type="indirect">
<org type="institution" xml:id="struct-302209" status="VALID">
<orgName>Bibliothèque Nationale de France</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle active="#struct-204893" type="direct">
<org type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
<relation active="#struct-300408" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="indirect">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-300408" type="indirect">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Le Havre</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université du Havre</orgName>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
<orgName type="university">Université de Rouen</orgName>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Optical Character Recognition</term>
<term>Quality Assessment</term>
<term>Segmentation defects</term>
<term>Texture Characterization</term>
</keywords>
<keywords scheme="mix" xml:lang="fr">
<term>Analyse de texture</term>
<term>Classification</term>
<term>Erreur de segmentation</term>
<term>Prédiction de performances</term>
<term>Reconnaissance de caractères</term>
<term>Reconnaissance optique de caractères</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Classification</term>
<term>Reconnaissance optique de caractères</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This work focuses on the assessment of characters recognition results produced automatically by optical character recognition software (OCR on mass digitization projects. The goal is to design a global control system robust enough to deal with BnF documents collection. This collection includes old documents which are difficult to be treated by OCR. We designed a word detection system to detect missed words defects in OCR results, and a words recognition rate estimator to assess the quality of word recognition results performed by OCR.We create two kinds of descriptors to characterize OCR outputs. Image descriptors to characterize page segmentation results and cross alignment descriptors to characterize the quality of word recognition results. Furthermore, we adapt our learning process to make an adaptive decision or prediction systems. We evaluated our control systems on real images selected randomly from BnF collection. The mmissed word detection system detects 84.15% of words omitted by the OCR with a precision of 94.73%. The experiments performed also showed that 80% of the documents of word recognition rate less than 98% are detected with an accuracy of 92%. It can also automatically detect 45% of the material having a recognition rate less than 70% with greater than 92% accuracy.</div>
</front>
</TEI>
<hal api="V3">
<titleStmt>
<title xml:lang="en">Adaptative quality control of digital documents in mass digitization projects</title>
<title xml:lang="fr">Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse</title>
<author role="aut">
<persName>
<forename type="first">Ahmed</forename>
<surname>Ben Salah</surname>
</persName>
<email>ahmed.ben-salah@bnf.fr</email>
<ptr type="url" target="http://www.univ-rouen.fr/34392/0/fiche_annuaire/"></ptr>
<idno type="halauthor">768392</idno>
<affiliation ref="#struct-399419"></affiliation>
</author>
<editor role="depositor">
<persName>
<forename>Ahmed</forename>
<surname>Ben Salah</surname>
</persName>
<email>dr.ahmed.ben.salah@gmail.com</email>
</editor>
<funder>Plan Triennal de recherche</funder>
</titleStmt>
<editionStmt>
<edition n="v1" type="current">
<date type="whenSubmitted">2015-06-22 18:42:00</date>
<date type="whenModified">2016-04-28 15:43:29</date>
<date type="whenReleased">2015-06-23 09:18:39</date>
<date type="whenProduced">2014-07-11</date>
<date type="whenEndEmbargoed">2015-06-22</date>
<ref type="file" target="https://hal-bnf.archives-ouvertes.fr/tel-01164698/document">
<date notBefore="2015-06-22"></date>
</ref>
<ref type="file" n="1" target="https://hal-bnf.archives-ouvertes.fr/tel-01164698/file/These%20Ahmed%20Ben%20Salah%20vf%20mod%20%281%29.pdf">
<date notBefore="2015-06-22"></date>
</ref>
</edition>
<respStmt>
<resp>contributor</resp>
<name key="119033">
<persName>
<forename>Ahmed</forename>
<surname>Ben Salah</surname>
</persName>
<email>dr.ahmed.ben.salah@gmail.com</email>
</name>
</respStmt>
</editionStmt>
<publicationStmt>
<distributor>CCSD</distributor>
<idno type="halId">tel-01164698</idno>
<idno type="halUri">https://hal-bnf.archives-ouvertes.fr/tel-01164698</idno>
<idno type="halBibtex">bensalah:tel-01164698</idno>
<idno type="halRefHtml">Traitement des images. Université de Rouen, 2014. Français</idno>
<idno type="halRef">Traitement des images. Université de Rouen, 2014. Français</idno>
</publicationStmt>
<seriesStmt>
<idno type="stamp" n="BNF">Bibliothèque nationale de France</idno>
<idno type="stamp" n="UNIV-TOURS">Université François Rabelais</idno>
<idno type="stamp" n="UNIV-LEHAVRE">Université du Havre</idno>
<idno type="stamp" n="UNIV-ROUEN">Université de Rouen</idno>
<idno type="stamp" n="LITIS">Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes</idno>
<idno type="stamp" n="COMUE-NORMANDIE">Normandie Université</idno>
<idno type="stamp" n="BNF_DSC" p="BNF">Bibliothèque nationale de France, Département de la Conservation</idno>
</seriesStmt>
<notesStmt></notesStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Adaptative quality control of digital documents in mass digitization projects</title>
<title xml:lang="fr">Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse</title>
<author role="aut">
<persName>
<forename type="first">Ahmed</forename>
<surname>Ben Salah</surname>
</persName>
<email>ahmed.ben-salah@bnf.fr</email>
<ptr type="url" target="http://www.univ-rouen.fr/34392/0/fiche_annuaire/"></ptr>
<idno type="halAuthorId">768392</idno>
<affiliation ref="#struct-399419"></affiliation>
</author>
</analytic>
<monogr>
<imprint>
<date type="dateDefended">2014-07-11</date>
</imprint>
<authority type="institution">Université de Rouen</authority>
<authority type="school">SPMII - Sciences Physiques, Mathématiques et de l'Information pour l'Ingénieur</authority>
<authority type="supervisor">Thierry Paquet</authority>
<authority type="supervisor">Nicolas Ragot</authority>
<authority type="jury">Jean-Marc Ogier</authority>
<authority type="jury">Remy Mullot</authority>
<authority type="jury">Jean-philippe Domenger</authority>
<authority type="jury">Laurent Duplouy</authority>
<authority type="jury">Thierry Pardé</authority>
</monogr>
<ref type="seeAlso">https://hal.archives-ouvertes.fr/BNF_DSC/hal-00820564v1</ref>
<ref type="seeAlso">http://production-scientifique.bnf.fr/Biblio/prediction-selection-decision-document-using-bibliographic-data-national-library-france-bnf</ref>
<ref type="seeAlso">http://production-scientifique.bnf.fr/Biblio/digital-documents-quality-control-workflow-bnf-operation-issue-improvement</ref>
<ref type="seeAlso">http://production-scientifique.bnf.fr/Biblio/aide-la-gestion-des-processus-de-numerisation-en-vue-de-locrisation-des-ouvrages</ref>
</biblStruct>
</sourceDesc>
<profileDesc>
<langUsage>
<language ident="fr">French</language>
</langUsage>
<textClass>
<keywords scheme="author">
<term xml:lang="en">Segmentation defects</term>
<term xml:lang="en">Texture Characterization</term>
<term xml:lang="en">Quality Assessment</term>
<term xml:lang="en">Optical Character Recognition</term>
<term xml:lang="fr">Erreur de segmentation</term>
<term xml:lang="fr">Analyse de texture</term>
<term xml:lang="fr">Classification</term>
<term xml:lang="fr">Reconnaissance de caractères</term>
<term xml:lang="fr">Prédiction de performances</term>
<term xml:lang="fr">Reconnaissance optique de caractères</term>
</keywords>
<classCode scheme="halDomain" n="info.info-ti">Computer Science [cs]/Image Processing</classCode>
<classCode scheme="halTypology" n="THESE">Theses</classCode>
</textClass>
<abstract xml:lang="en">This work focuses on the assessment of characters recognition results produced automatically by optical character recognition software (OCR on mass digitization projects. The goal is to design a global control system robust enough to deal with BnF documents collection. This collection includes old documents which are difficult to be treated by OCR. We designed a word detection system to detect missed words defects in OCR results, and a words recognition rate estimator to assess the quality of word recognition results performed by OCR.We create two kinds of descriptors to characterize OCR outputs. Image descriptors to characterize page segmentation results and cross alignment descriptors to characterize the quality of word recognition results. Furthermore, we adapt our learning process to make an adaptive decision or prediction systems. We evaluated our control systems on real images selected randomly from BnF collection. The mmissed word detection system detects 84.15% of words omitted by the OCR with a precision of 94.73%. The experiments performed also showed that 80% of the documents of word recognition rate less than 98% are detected with an accuracy of 92%. It can also automatically detect 45% of the material having a recognition rate less than 70% with greater than 92% accuracy.</abstract>
<abstract xml:lang="fr">Ce travail s’intéresse au contrôle des résultats de transcriptions numériques produites automatiquement par des logiciels de reconnaissance optique de caractères (OCR), lors de la réalisation de projets de numérisation de masse de documents. Le but de nos travaux est de concevoir un système de contrôle des résultats d’OCR suffisamment robuste pour être performant sur l’ensemble des documents numérisés à la BnF. Cettecollection est composée de documents anciens dont les particularités les rendent difficiles à traiter par les OCR, même les plus performants. Nous avons conçu un système de détection des mots omis dans les transcriptions, ainsi qu’une méthode d’estimation des taux dereconnaissance des caractères. Le contexte applicatif exclu de recourir à une vérité terrain pour évaluer les performances. Nous essayons donc de les prédire. Pour cela nous proposons différents descripteurs qui permettent de caractériser les résultats des transcriptions. Cette caractérisation intervient à deux niveaux. Elle permet d’une part de caractériser la segmentation des documents à l’aide de descripteurs de textures, et d’autres part de caractériser les textes produits en ayant recours à un second OCR qui joue le rôle d’une référence relative. Dans les deux cas, les descripteurs choisis permettent de s’adapter aux propriétés des corpus à contrôler. L’adaptation est également assurée par une étape d’apprentissage des étages de décision ou de prédiction qui interviennent dans le système. Nous avons évalué nos systèmes de contrôle sur des bases d’images réelles sélectionnées dans les collections documentaires de la BnF. Le système détecte 84, 15% des mots omis par l’OCR avec une précision de 94, 73%. Les expérimentations réalisées ont également permis de montrer que 80% des documents présentant un taux de reconnaissance mots inférieur à 98% sont détectés avec une précision de 92%. On peut également détecter automatiquement 45% des documents présentant un taux de reconnaissance inférieur à 70% avec une précision supérieure à 92%.</abstract>
<particDesc>
<org type="consortium">Bibliothèque nationale de France</org>
<org type="consortium">Université Francois Rabelais</org>
</particDesc>
</profileDesc>
</hal>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/France/explor/LeHavreV1/Data/Hal/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000120 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Hal/Checkpoint/biblio.hfd -nk 000120 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/France
   |area=    LeHavreV1
   |flux=    Hal
   |étape=   Checkpoint
   |type=    RBID
   |clé=     Hal:tel-01164698
   |texte=   Adaptative quality control of digital documents in mass digitization projects
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Sat Dec 3 14:37:02 2016. Site generation: Tue Mar 5 08:25:07 2024