A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models
Identifieur interne : 000819 ( Main/Exploration ); précédent : 000818; suivant : 000820A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models
Auteurs : Dharitri Misra ; Siyuan Chen ; George R. ThomaSource :
- Archiving ... . IS & T's Archiving Conference ; 2009.
Abstract
One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.
At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.
In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.
Url:
PubMed: 21179386
PubMed Central: 3004227
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: 000090
- to stream Pmc, to step Curation: 000090
- to stream Pmc, to step Checkpoint: 000165
- to stream Ncbi, to step Merge: 000092
- to stream Ncbi, to step Curation: 000092
- to stream Ncbi, to step Checkpoint: 000092
- to stream Main, to step Merge: 000827
- to stream Main, to step Curation: 000819
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models</title>
<author><name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
</author>
<author><name sortKey="Chen, Siyuan" sort="Chen, Siyuan" uniqKey="Chen S" first="Siyuan" last="Chen">Siyuan Chen</name>
</author>
<author><name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">21179386</idno>
<idno type="pmc">3004227</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3004227</idno>
<idno type="RBID">PMC:3004227</idno>
<date when="2009">2009</date>
<idno type="wicri:Area/Pmc/Corpus">000090</idno>
<idno type="wicri:Area/Pmc/Curation">000090</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000165</idno>
<idno type="wicri:Area/Ncbi/Merge">000092</idno>
<idno type="wicri:Area/Ncbi/Curation">000092</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000092</idno>
<idno type="wicri:Area/Main/Merge">000827</idno>
<idno type="wicri:Area/Main/Curation">000819</idno>
<idno type="wicri:Area/Main/Exploration">000819</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models</title>
<author><name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
</author>
<author><name sortKey="Chen, Siyuan" sort="Chen, Siyuan" uniqKey="Chen S" first="Siyuan" last="Chen">Siyuan Chen</name>
</author>
<author><name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</author>
</analytic>
<series><title level="j">Archiving ... . IS & T's Archiving Conference</title>
<imprint><date when="2009">2009</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p id="P1">One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.</p>
<p id="P2">At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.</p>
<p id="P3">In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.</p>
</div>
</front>
</TEI>
<affiliations><list></list>
<tree><noCountry><name sortKey="Chen, Siyuan" sort="Chen, Siyuan" uniqKey="Chen S" first="Siyuan" last="Chen">Siyuan Chen</name>
<name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</noCountry>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000819 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000819 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= PMC:3004227 |texte= A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i -Sk "pubmed:21179386" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd \ | NlmPubMed2Wicri -a OcrV1
This area was generated with Dilib version V0.6.32. |