Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models

Identifieur interne : 000819 ( Main/Exploration ); précédent : 000818; suivant : 000820

A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models

Auteurs : Dharitri Misra ; Siyuan Chen ; George R. Thoma

Source :

RBID : PMC:3004227

Abstract

One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.

At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.

In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.


Url:
PubMed: 21179386
PubMed Central: 3004227


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models</title>
<author>
<name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
</author>
<author>
<name sortKey="Chen, Siyuan" sort="Chen, Siyuan" uniqKey="Chen S" first="Siyuan" last="Chen">Siyuan Chen</name>
</author>
<author>
<name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21179386</idno>
<idno type="pmc">3004227</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3004227</idno>
<idno type="RBID">PMC:3004227</idno>
<date when="2009">2009</date>
<idno type="wicri:Area/Pmc/Corpus">000090</idno>
<idno type="wicri:Area/Pmc/Curation">000090</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000165</idno>
<idno type="wicri:Area/Ncbi/Merge">000092</idno>
<idno type="wicri:Area/Ncbi/Curation">000092</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000092</idno>
<idno type="wicri:Area/Main/Merge">000827</idno>
<idno type="wicri:Area/Main/Curation">000819</idno>
<idno type="wicri:Area/Main/Exploration">000819</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models</title>
<author>
<name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
</author>
<author>
<name sortKey="Chen, Siyuan" sort="Chen, Siyuan" uniqKey="Chen S" first="Siyuan" last="Chen">Siyuan Chen</name>
</author>
<author>
<name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</author>
</analytic>
<series>
<title level="j">Archiving ... . IS & T's Archiving Conference</title>
<imprint>
<date when="2009">2009</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p id="P1">One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.</p>
<p id="P2">At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.</p>
<p id="P3">In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.</p>
</div>
</front>
</TEI>
<affiliations>
<list></list>
<tree>
<noCountry>
<name sortKey="Chen, Siyuan" sort="Chen, Siyuan" uniqKey="Chen S" first="Siyuan" last="Chen">Siyuan Chen</name>
<name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</noCountry>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000819 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000819 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:3004227
   |texte=   A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:21179386" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024