OcrV1, Istex, Corpus, bibRecord, 001079

Text Verification in an Automated System for the Extraction of Bibliographic Data

Identifieur interne : 001079 ( Istex/Corpus ); précédent : 001078; suivant : 001080

Text Verification in an Automated System for the Extraction of Bibliographic Data

Auteurs : R. Thoma ; Glenn Ford ; Daniel Le ; Zhirong Li

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2002.

RBID : ISTEX:5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539

Abstract

Abstract: An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.

Url:

https://api.istex.fr/document/5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539/fulltext/pdf

DOI: 10.1007/3-540-45869-7_46

Links to Exploration step

ISTEX:5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Text Verification in an Automated System for the Extraction of Bibliographic Data</title>
<author><name sortKey="Thoma, R" sort="Thoma, R" uniqKey="Thoma R" first="R." last="Thoma">R. Thoma</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Ford, Glenn" sort="Ford, Glenn" uniqKey="Ford G" first="Glenn" last="Ford">Glenn Ford</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Le, Daniel" sort="Le, Daniel" uniqKey="Le D" first="Daniel" last="Le">Daniel Le</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Li, Zhirong" sort="Li, Zhirong" uniqKey="Li Z" first="Zhirong" last="Li">Zhirong Li</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539</idno>
<date when="2002" year="2002">2002</date>
<idno type="doi">10.1007/3-540-45869-7_46</idno>
<idno type="url">https://api.istex.fr/document/5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001079</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Text Verification in an Automated System for the Extraction of Bibliographic Data</title>
<author><name sortKey="Thoma, R" sort="Thoma, R" uniqKey="Thoma R" first="R." last="Thoma">R. Thoma</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Ford, Glenn" sort="Ford, Glenn" uniqKey="Ford G" first="Glenn" last="Ford">Glenn Ford</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Le, Daniel" sort="Le, Daniel" uniqKey="Le D" first="Daniel" last="Le">Daniel Le</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
<author><name sortKey="Li, Zhirong" sort="Li, Zhirong" uniqKey="Li Z" first="Zhirong" last="Li">Zhirong Li</name>
<affiliation><mods:affiliation>National Library of Medicine, Bethesda, Maryland</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2002</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539</idno>
<idno type="DOI">10.1007/3-540-45869-7_46</idno>
<idno type="ChapterID">46</idno>
<idno type="ChapterID">Chap46</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.</div>
</front>
</TEI>
<istex><corpusName>springer</corpusName>
<author><json:item><name>George R. Thoma</name>
<affiliations><json:string>National Library of Medicine, Bethesda, Maryland</json:string>
</affiliations>
</json:item>
<json:item><name>Glenn Ford</name>
<affiliations><json:string>National Library of Medicine, Bethesda, Maryland</json:string>
</affiliations>
</json:item>
<json:item><name>Daniel Le</name>
<affiliations><json:string>National Library of Medicine, Bethesda, Maryland</json:string>
</affiliations>
</json:item>
<json:item><name>Zhirong Li</name>
<affiliations><json:string>National Library of Medicine, Bethesda, Maryland</json:string>
</affiliations>
</json:item>
</author>
<language><json:string>eng</json:string>
</language>
<abstract>Abstract: An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.</abstract>
<qualityIndicators><score>3.403</score>
<pdfVersion>1.3</pdfVersion>
<pdfPageSize>648 x 864 pts</pdfPageSize>
<refBibsNative>false</refBibsNative>
<keywordCount>0</keywordCount>
<abstractCharCount>468</abstractCharCount>
<pdfWordCount>2563</pdfWordCount>
<pdfCharCount>15282</pdfCharCount>
<pdfPageCount>10</pdfPageCount>
<abstractWordCount>70</abstractWordCount>
</qualityIndicators>
<title>Text Verification in an Automated System for the Extraction of Bibliographic Data</title>
<genre.original><json:string>OriginalPaper</json:string>
</genre.original>
<chapterId><json:string>46</json:string>
<json:string>Chap46</json:string>
</chapterId>
<genre><json:string>conference [eBooks]</json:string>
</genre>
<serie><editor><json:item><name>Gerhard Goos</name>
<affiliations><json:string>Karlsruhe University, Germany</json:string>
</affiliations>
</json:item>
<json:item><name>Juris Hartmanis</name>
<affiliations><json:string>Cornell University, NY, USA</json:string>
</affiliations>
</json:item>
<json:item><name>Jan van Leeuwen</name>
<affiliations><json:string>Utrecht University, The Netherlands</json:string>
</affiliations>
</json:item>
</editor>
<issn><json:string>0302-9743</json:string>
</issn>
<language><json:string>unknown</json:string>
</language>
<title>Lecture Notes in Computer Science</title>
<copyrightDate>2002</copyrightDate>
</serie>
<host><editor><json:item><name>Daniel Lopresti</name>
<affiliations><json:string>Bell Labs, Lucent Technologies, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA</json:string>
<json:string>E-mail: dpl@research.bell-labs.com</json:string>
</affiliations>
</json:item>
<json:item><name>Jianying Hu</name>
<affiliations><json:string>Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA</json:string>
<json:string>E-mail: jianhu@research.avayalabs.com</json:string>
</affiliations>
</json:item>
<json:item><name>Ramanujan Kashi</name>
<affiliations><json:string>Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA</json:string>
<json:string>E-mail: ramanuja@research.avayalabs.com</json:string>
</affiliations>
</json:item>
</editor>
<subject><json:item><value>Computer Science</value>
</json:item>
<json:item><value>Computer Science</value>
</json:item>
<json:item><value>Pattern Recognition</value>
</json:item>
<json:item><value>Information Storage and Retrieval</value>
</json:item>
<json:item><value>Document Preparation and Text Processing</value>
</json:item>
<json:item><value>Image Processing and Computer Vision</value>
</json:item>
</subject>
<isbn><json:string>978-3-540-44068-0</json:string>
</isbn>
<language><json:string>unknown</json:string>
</language>
<title>Document Analysis Systems V</title>
<genre.original><json:string>Proceedings</json:string>
</genre.original>
<bookId><json:string>3-540-45869-7</json:string>
</bookId>
<volume>2423</volume>
<pages><last>432</last>
<first>423</first>
</pages>
<issn><json:string>0302-9743</json:string>
</issn>
<genre><json:string>Book Series</json:string>
</genre>
<eisbn><json:string>978-3-540-45869-2</json:string>
</eisbn>
<copyrightDate>2002</copyrightDate>
<doi><json:string>10.1007/3-540-45869-7</json:string>
</doi>
</host>
<publicationDate>2002</publicationDate>
<copyrightDate>2002</copyrightDate>
<doi><json:string>10.1007/3-540-45869-7_46</json:string>
</doi>
<id>5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539</id>
<fulltext><json:item><original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539/fulltext/pdf</uri>
</json:item>
<json:item><original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539/fulltext/tei"><teiHeader><fileDesc><titleStmt><title level="a" type="main" xml:lang="en">Text Verification in an Automated System for the Extraction of Bibliographic Data</title>
<respStmt xml:id="ISTEX-API" resp="Références bibliographiques récupérées via GROBID" name="ISTEX-API (INIST-CNRS)"></respStmt>
</titleStmt>
<publicationStmt><authority>ISTEX</authority>
<publisher>Springer Berlin Heidelberg</publisher>
<pubPlace>Berlin, Heidelberg</pubPlace>
<availability><p>SPRINGER</p>
</availability>
<date>2002</date>
</publicationStmt>
<sourceDesc><biblStruct type="inbook"><analytic><title level="a" type="main" xml:lang="en">Text Verification in an Automated System for the Extraction of Bibliographic Data</title>
<author><persName><forename type="first">George</forename>
<surname>Thoma</surname>
</persName>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
</author>
<author><persName><forename type="first">Glenn</forename>
<surname>Ford</surname>
</persName>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
</author>
<author><persName><forename type="first">Daniel</forename>
<surname>Le</surname>
</persName>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
</author>
<author><persName><forename type="first">Zhirong</forename>
<surname>Li</surname>
</persName>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
</author>
</analytic>
<monogr><title level="m">Document Analysis Systems V</title>
<title level="m" type="sub">5th International Workshop, DAS 2002 Princeton, NJ, USA, August 19–21, 2002 Proceedings</title>
<idno type="pISBN">978-3-540-44068-0</idno>
<idno type="eISBN">978-3-540-45869-2</idno>
<idno type="pISSN">0302-9743</idno>
<idno type="DOI">10.1007/3-540-45869-7</idno>
<idno type="BookID">3-540-45869-7</idno>
<idno type="BookTitleID">72681</idno>
<idno type="BookSequenceNumber">2423</idno>
<idno type="BookVolumeNumber">2423</idno>
<idno type="BookChapterCount">58</idno>
<editor><persName><forename type="first">Daniel</forename>
<surname>Lopresti</surname>
</persName>
<email>dpl@research.bell-labs.com</email>
<affiliation>Bell Labs, Lucent Technologies, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA</affiliation>
</editor>
<editor><persName><forename type="first">Jianying</forename>
<surname>Hu</surname>
</persName>
<email>jianhu@research.avayalabs.com</email>
<affiliation>Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA</affiliation>
</editor>
<editor><persName><forename type="first">Ramanujan</forename>
<surname>Kashi</surname>
</persName>
<email>ramanuja@research.avayalabs.com</email>
<affiliation>Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA</affiliation>
</editor>
<imprint><publisher>Springer Berlin Heidelberg</publisher>
<pubPlace>Berlin, Heidelberg</pubPlace>
<date type="published" when="2002"></date>
<biblScope unit="volume">2423</biblScope>
<biblScope unit="page" from="423">423</biblScope>
<biblScope unit="page" to="432">432</biblScope>
</imprint>
</monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<editor><persName><forename type="first">Gerhard</forename>
<surname>Goos</surname>
</persName>
<affiliation>Karlsruhe University, Germany</affiliation>
</editor>
<editor><persName><forename type="first">Juris</forename>
<surname>Hartmanis</surname>
</persName>
<affiliation>Cornell University, NY, USA</affiliation>
</editor>
<editor><persName><forename type="first">Jan</forename>
<surname>van Leeuwen</surname>
</persName>
<affiliation>Utrecht University, The Netherlands</affiliation>
</editor>
<biblScope><date>2002</date>
</biblScope>
<idno type="pISSN">0302-9743</idno>
<idno type="seriesId">558</idno>
</series>
<idno type="istex">5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539</idno>
<idno type="DOI">10.1007/3-540-45869-7_46</idno>
<idno type="ChapterID">46</idno>
<idno type="ChapterID">Chap46</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><creation><date>2002</date>
</creation>
<langUsage><language ident="en">en</language>
</langUsage>
<abstract xml:lang="en"><p>Abstract: An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.</p>
</abstract>
<textClass><keywords scheme="Book Subject Collection"><list><label>SUCO11645</label>
<item><term>Computer Science</term>
</item>
</list>
</keywords>
</textClass>
<textClass><keywords scheme="Book Subject Group"><list><label>I</label>
<label>I2203X</label>
<label>I18032</label>
<label>I21033</label>
<label>I22021</label>
<item><term>Computer Science</term>
</item>
<item><term>Pattern Recognition</term>
</item>
<item><term>Information Storage and Retrieval</term>
</item>
<item><term>Document Preparation and Text Processing</term>
</item>
<item><term>Image Processing and Computer Vision</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc><change when="2002">Published</change>
<change xml:id="refBibs-istex" who="#ISTEX-API" when="2016-3-19">References added</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item><original>false</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata><istex:metadataXml wicri:clean="Springer, Publisher found" wicri:toSee="no header"><istex:xmlDeclaration>version="1.0" encoding="UTF-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//Springer-Verlag//DTD A++ V2.4//EN" URI="http://devel.springer.de/A++/V2.4/DTD/A++V2.4.dtd" name="istex:docType"></istex:docType>
<istex:document><Publisher><PublisherInfo><PublisherName>Springer Berlin Heidelberg</PublisherName>
<PublisherLocation>Berlin, Heidelberg</PublisherLocation>
</PublisherInfo>
<Series><SeriesInfo TocLevels="0" SeriesType="Series"><SeriesID>558</SeriesID>
<SeriesPrintISSN>0302-9743</SeriesPrintISSN>
<SeriesTitle Language="En">Lecture Notes in Computer Science</SeriesTitle>
</SeriesInfo>
<SeriesHeader><EditorGroup><Editor AffiliationIDS="Aff1"><EditorName DisplayOrder="Western"><GivenName>Gerhard</GivenName>
<FamilyName>Goos</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff2"><EditorName DisplayOrder="Western"><GivenName>Juris</GivenName>
<FamilyName>Hartmanis</FamilyName>
</EditorName>
</Editor>
<Editor AffiliationIDS="Aff3"><EditorName DisplayOrder="Western"><GivenName>Jan</GivenName>
<Particle>van</Particle>
<FamilyName>Leeuwen</FamilyName>
</EditorName>
</Editor>
<Affiliation ID="Aff1"><OrgName>Karlsruhe University</OrgName>
<OrgAddress><Country>Germany</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff2"><OrgName>Cornell University</OrgName>
<OrgAddress><State>NY</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff3"><OrgName>Utrecht University</OrgName>
<OrgAddress><Country>The Netherlands</Country>
</OrgAddress>
</Affiliation>
</EditorGroup>
</SeriesHeader>
<Book Language="En"><BookInfo Language="En" TocLevels="0" NumberingStyle="Unnumbered" ContainsESM="No" BookProductType="Proceedings" MediaType="eBook"><BookID>3-540-45869-7</BookID>
<BookTitle>Document Analysis Systems V</BookTitle>
<BookSubTitle>5th International Workshop, DAS 2002 Princeton, NJ, USA, August 19–21, 2002 Proceedings</BookSubTitle>
<BookVolumeNumber>2423</BookVolumeNumber>
<BookSequenceNumber>2423</BookSequenceNumber>
<BookDOI>10.1007/3-540-45869-7</BookDOI>
<BookTitleID>72681</BookTitleID>
<BookPrintISBN>978-3-540-44068-0</BookPrintISBN>
<BookElectronicISBN>978-3-540-45869-2</BookElectronicISBN>
<BookChapterCount>58</BookChapterCount>
<BookHistory><OnlineDate><Year>2002</Year>
<Month>8</Month>
<Day>9</Day>
</OnlineDate>
</BookHistory>
<BookCopyright><CopyrightHolderName>Springer-Verlag Berlin Heidelberg</CopyrightHolderName>
<CopyrightYear>2002</CopyrightYear>
</BookCopyright>
<BookSubjectGroup><BookSubject Code="I" Type="Primary">Computer Science</BookSubject>
<BookSubject Code="I2203X" Priority="1" Type="Secondary">Pattern Recognition</BookSubject>
<BookSubject Code="I18032" Priority="2" Type="Secondary">Information Storage and Retrieval</BookSubject>
<BookSubject Code="I21033" Priority="3" Type="Secondary">Document Preparation and Text Processing</BookSubject>
<BookSubject Code="I22021" Priority="4" Type="Secondary">Image Processing and Computer Vision</BookSubject>
<SubjectCollection Code="SUCO11645">Computer Science</SubjectCollection>
</BookSubjectGroup>
<BookContext><SeriesID>558</SeriesID>
</BookContext>
</BookInfo>
<BookHeader><EditorGroup><Editor AffiliationIDS="Aff4"><EditorName DisplayOrder="Western"><GivenName>Daniel</GivenName>
<FamilyName>Lopresti</FamilyName>
</EditorName>
<Contact><Email>dpl@research.bell-labs.com</Email>
</Contact>
</Editor>
<Editor AffiliationIDS="Aff5"><EditorName DisplayOrder="Western"><GivenName>Jianying</GivenName>
<FamilyName>Hu</FamilyName>
</EditorName>
<Contact><Email>jianhu@research.avayalabs.com</Email>
</Contact>
</Editor>
<Editor AffiliationIDS="Aff5"><EditorName DisplayOrder="Western"><GivenName>Ramanujan</GivenName>
<FamilyName>Kashi</FamilyName>
</EditorName>
<Contact><Email>ramanuja@research.avayalabs.com</Email>
</Contact>
</Editor>
<Affiliation ID="Aff4"><OrgName>Bell Labs, Lucent Technologies</OrgName>
<OrgAddress><Street>600 Mountain Avenue</Street>
<Postcode>07974</Postcode>
<City>Murray Hill</City>
<State>NJ</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
<Affiliation ID="Aff5"><OrgName>Avaya Labs Research</OrgName>
<OrgAddress><Street>233 Mount Airy Road</Street>
<Postcode>07920</Postcode>
<City>Basking Ridge</City>
<State>NJ</State>
<Country>USA</Country>
</OrgAddress>
</Affiliation>
</EditorGroup>
</BookHeader>
<Part ID="Part7"><PartInfo TocLevels="0"><PartID>7</PartID>
<PartSequenceNumber>7</PartSequenceNumber>
<PartTitle>Indexing and Retrieval</PartTitle>
<PartChapterCount>7</PartChapterCount>
<PartContext><SeriesID>558</SeriesID>
<BookID>3-540-45869-7</BookID>
<BookTitle>Document Analysis Systems V</BookTitle>
</PartContext>
</PartInfo>
<Chapter ID="Chap46" Language="En"><ChapterInfo ChapterType="OriginalPaper" NumberingStyle="Unnumbered" Language="En" TocLevels="0" ContainsESM="No"><ChapterID>46</ChapterID>
<ChapterDOI>10.1007/3-540-45869-7_46</ChapterDOI>
<ChapterSequenceNumber>46</ChapterSequenceNumber>
<ChapterTitle Language="En">Text Verification in an Automated System for the Extraction of Bibliographic Data</ChapterTitle>
<ChapterFirstPage>423</ChapterFirstPage>
<ChapterLastPage>432</ChapterLastPage>
<ChapterCopyright><CopyrightHolderName>Springer-Verlag Berlin Heidelberg</CopyrightHolderName>
<CopyrightYear>2002</CopyrightYear>
</ChapterCopyright>
<ChapterHistory><RegistrationDate><Year>2002</Year>
<Month>8</Month>
<Day>8</Day>
</RegistrationDate>
<OnlineDate><Year>2002</Year>
<Month>8</Month>
<Day>9</Day>
</OnlineDate>
</ChapterHistory>
<ChapterGrants Type="Regular"><MetadataGrant Grant="OpenAccess"></MetadataGrant>
<AbstractGrant Grant="OpenAccess"></AbstractGrant>
<BodyPDFGrant Grant="Restricted"></BodyPDFGrant>
<BodyHTMLGrant Grant="Restricted"></BodyHTMLGrant>
<BibliographyGrant Grant="Restricted"></BibliographyGrant>
<ESMGrant Grant="Restricted"></ESMGrant>
</ChapterGrants>
<ChapterContext><SeriesID>558</SeriesID>
<PartID>7</PartID>
<BookID>3-540-45869-7</BookID>
<BookTitle>Document Analysis Systems V</BookTitle>
</ChapterContext>
</ChapterInfo>
<ChapterHeader><AuthorGroup><Author AffiliationIDS="Aff6"><AuthorName DisplayOrder="Western"><GivenName>George</GivenName>
<GivenName>R.</GivenName>
<FamilyName>Thoma</FamilyName>
</AuthorName>
</Author>
<Author AffiliationIDS="Aff6"><AuthorName DisplayOrder="Western"><GivenName>Glenn</GivenName>
<FamilyName>Ford</FamilyName>
</AuthorName>
</Author>
<Author AffiliationIDS="Aff6"><AuthorName DisplayOrder="Western"><GivenName>Daniel</GivenName>
<FamilyName>Le</FamilyName>
</AuthorName>
</Author>
<Author AffiliationIDS="Aff6"><AuthorName DisplayOrder="Western"><GivenName>Zhirong</GivenName>
<FamilyName>Li</FamilyName>
</AuthorName>
</Author>
<Affiliation ID="Aff6"><OrgName>National Library of Medicine</OrgName>
<OrgAddress><City>Bethesda, Maryland</City>
</OrgAddress>
</Affiliation>
</AuthorGroup>
<Abstract ID="Abs1" Language="En"><Heading>Abstract</Heading>
<Para>An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.</Para>
</Abstract>
</ChapterHeader>
<NoBody></NoBody>
</Chapter>
</Part>
</Book>
</Series>
</Publisher>
</istex:document>
</istex:metadataXml>
<mods version="3.6"><titleInfo lang="en"><title>Text Verification in an Automated System for the Extraction of Bibliographic Data</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA" lang="en"><title>Text Verification in an Automated System for the Extraction of Bibliographic Data</title>
</titleInfo>
<name type="personal"><namePart type="given">George</namePart>
<namePart type="given">R.</namePart>
<namePart type="family">Thoma</namePart>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Glenn</namePart>
<namePart type="family">Ford</namePart>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Daniel</namePart>
<namePart type="family">Le</namePart>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Zhirong</namePart>
<namePart type="family">Li</namePart>
<affiliation>National Library of Medicine, Bethesda, Maryland</affiliation>
<role><roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="conference [eBooks]" displayLabel="OriginalPaper"></genre>
<originInfo><publisher>Springer Berlin Heidelberg</publisher>
<place><placeTerm type="text">Berlin, Heidelberg</placeTerm>
</place>
<dateIssued encoding="w3cdtf">2002</dateIssued>
<copyrightDate encoding="w3cdtf">2002</copyrightDate>
</originInfo>
<language><languageTerm type="code" authority="rfc3066">en</languageTerm>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
</language>
<physicalDescription><internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract lang="en">Abstract: An essential stage in any text extraction system is the manual verification of the printed material converted by OCR. This proves to be the most labor-intensive step in the process. In a system built and deployed at the National Library of Medicine to automatically extract bibliographic data from scanned biomedical journals, alternative means were considered to validate the text. This paper describes two approaches and gives preliminary performance data.</abstract>
<relatedItem type="host"><titleInfo><title>Document Analysis Systems V</title>
<subTitle>5th International Workshop, DAS 2002 Princeton, NJ, USA, August 19–21, 2002 Proceedings</subTitle>
</titleInfo>
<name type="personal"><namePart type="given">Daniel</namePart>
<namePart type="family">Lopresti</namePart>
<affiliation>Bell Labs, Lucent Technologies, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA</affiliation>
<affiliation>E-mail: dpl@research.bell-labs.com</affiliation>
<role><roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Jianying</namePart>
<namePart type="family">Hu</namePart>
<affiliation>Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA</affiliation>
<affiliation>E-mail: jianhu@research.avayalabs.com</affiliation>
<role><roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Ramanujan</namePart>
<namePart type="family">Kashi</namePart>
<affiliation>Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA</affiliation>
<affiliation>E-mail: ramanuja@research.avayalabs.com</affiliation>
<role><roleTerm type="text">editor</roleTerm>
</role>
</name>
<genre type="Book Series" displayLabel="Proceedings"></genre>
<originInfo><copyrightDate encoding="w3cdtf">2002</copyrightDate>
<issuance>monographic</issuance>
</originInfo>
<subject><genre>Book Subject Collection</genre>
<topic authority="SpringerSubjectCodes" authorityURI="SUCO11645">Computer Science</topic>
</subject>
<subject><genre>Book Subject Group</genre>
<topic authority="SpringerSubjectCodes" authorityURI="I">Computer Science</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I2203X">Pattern Recognition</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I18032">Information Storage and Retrieval</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I21033">Document Preparation and Text Processing</topic>
<topic authority="SpringerSubjectCodes" authorityURI="I22021">Image Processing and Computer Vision</topic>
</subject>
<identifier type="DOI">10.1007/3-540-45869-7</identifier>
<identifier type="ISBN">978-3-540-44068-0</identifier>
<identifier type="eISBN">978-3-540-45869-2</identifier>
<identifier type="ISSN">0302-9743</identifier>
<identifier type="BookTitleID">72681</identifier>
<identifier type="BookID">3-540-45869-7</identifier>
<identifier type="BookChapterCount">58</identifier>
<identifier type="BookVolumeNumber">2423</identifier>
<identifier type="BookSequenceNumber">2423</identifier>
<identifier type="PartChapterCount">7</identifier>
<part><date>2002</date>
<detail type="part"><title>Indexing and Retrieval</title>
</detail>
<detail type="volume"><number>2423</number>
<caption>vol.</caption>
</detail>
<extent unit="pages"><start>423</start>
<end>432</end>
</extent>
</part>
<recordInfo><recordOrigin>Springer-Verlag Berlin Heidelberg, 2002</recordOrigin>
</recordInfo>
</relatedItem>
<relatedItem type="series"><titleInfo><title>Lecture Notes in Computer Science</title>
</titleInfo>
<name type="personal"><namePart type="given">Gerhard</namePart>
<namePart type="family">Goos</namePart>
<affiliation>Karlsruhe University, Germany</affiliation>
<role><roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Juris</namePart>
<namePart type="family">Hartmanis</namePart>
<affiliation>Cornell University, NY, USA</affiliation>
<role><roleTerm type="text">editor</roleTerm>
</role>
</name>
<name type="personal"><namePart type="given">Jan</namePart>
<namePart type="family">van Leeuwen</namePart>
<affiliation>Utrecht University, The Netherlands</affiliation>
<role><roleTerm type="text">editor</roleTerm>
</role>
</name>
<originInfo><copyrightDate encoding="w3cdtf">2002</copyrightDate>
<issuance>serial</issuance>
</originInfo>
<identifier type="ISSN">0302-9743</identifier>
<identifier type="SeriesID">558</identifier>
<recordInfo><recordOrigin>Springer-Verlag Berlin Heidelberg, 2002</recordOrigin>
</recordInfo>
</relatedItem>
<identifier type="istex">5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539</identifier>
<identifier type="DOI">10.1007/3-540-45869-7_46</identifier>
<identifier type="ChapterID">46</identifier>
<identifier type="ChapterID">Chap46</identifier>
<accessCondition type="use and reproduction" contentType="copyright">Springer-Verlag Berlin Heidelberg, 2002</accessCondition>
<recordInfo><recordContentSource>SPRINGER</recordContentSource>
<recordOrigin>Springer-Verlag Berlin Heidelberg, 2002</recordOrigin>
</recordInfo>
</mods>
</metadata>
<enrichments><istex:refBibTEI uri="https://api.istex.fr/document/5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539/enrichments/refBib"><teiHeader></teiHeader>
<text><front></front>
<body></body>
<back><listBibl><biblStruct xml:id="b0"><monogr><title level="m" type="main">Automating the production of bibliographic records for MEDLINE. An R&D report of the Communications Engineering</title>
<author><persName><forename type="first">Lhncbc</forename>
<surname>Branch</surname>
</persName>
</author>
<author><persName><forename type="first">Nlm</forename>
<surname>Bethesda</surname>
</persName>
</author>
<author><persName><forename type="first">Maryland</forename>
</persName>
</author>
<imprint><date type="published" when="2001-09"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b1"><analytic><title level="a" type="main">Automated zone correction in bitmapped document images</title>
<author><persName><forename type="first">Se</forename>
<surname>Hauser</surname>
</persName>
</author>
<author><persName><forename type="first">Dx</forename>
<surname>Le</surname>
</persName>
</author>
<author><persName><forename type="first">Gr</forename>
<surname>Thoma</surname>
</persName>
</author>
</analytic>
<monogr><title level="m">Proc. SPIE: Document Recognition and Retrieval VII</title>
<meeting>. SPIE: Document Recognition and Retrieval VII<address><addrLine>San Jose CA</addrLine>
</address>
</meeting>
<imprint><date type="published" when="2000-01"></date>
<biblScope unit="page" from="248" to="58"></biblScope>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b2"><analytic><title level="a" type="main">Automated Labeling in Document Images</title>
<author><persName><forename type="first">J</forename>
<surname>Kim</surname>
</persName>
</author>
<author><persName><forename type="first">Dx</forename>
<surname>Le</surname>
</persName>
</author>
<author><persName><forename type="first">Gr</forename>
<surname>Thoma</surname>
</persName>
</author>
</analytic>
<monogr><title level="m">Proc. SPIE: Document Recognition and Retrieval VIII</title>
<meeting>. SPIE: Document Recognition and Retrieval VIII<address><addrLine>San Jose CA</addrLine>
</address>
</meeting>
<imprint><date type="published" when="2001-01"></date>
<biblScope unit="page" from="111" to="22"></biblScope>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b3"><analytic><title level="a" type="main">Automatic reformatting of OCR text from biomedical journal articles</title>
<author><persName><forename type="first">Gm</forename>
<surname>Ford</surname>
</persName>
</author>
<author><persName><forename type="first">Se</forename>
<surname>Hauser</surname>
</persName>
</author>
<author><persName><forename type="first">Gr</forename>
<surname>Thoma</surname>
</persName>
</author>
</analytic>
<monogr><title level="m">Proc.1999 Symposium on Document Image Understanding Technology</title>
<meeting>.1999 Symposium on Document Image Understanding Technology<address><addrLine>College Park, MD</addrLine>
</address>
</meeting>
<imprint><biblScope unit="page" from="321" to="25"></biblScope>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b4"><analytic><title level="a" type="main">Pattern matching techniques for correcting low confidence OCR words in a known context</title>
<author><persName><forename type="first">G</forename>
<surname>Ford</surname>
</persName>
</author>
<author><persName><forename type="first">Se</forename>
<surname>Hauser</surname>
</persName>
</author>
<author><persName><forename type="first">Dx</forename>
<surname>Le</surname>
</persName>
</author>
<author><persName><forename type="first">Gr</forename>
<surname>Thoma</surname>
</persName>
</author>
</analytic>
<monogr><title level="m">Proc. SPIE Document Recognition and Retrieval VIII</title>
<meeting>. SPIE Document Recognition and Retrieval VIII</meeting>
<imprint><date type="published" when="2001-01"></date>
<biblScope unit="page" from="241" to="9"></biblScope>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b5"><analytic><title level="a" type="main">Approximate string matching algorithms for limited-vocabulary OCR output correction Document Recognition and Retrieval VIII</title>
<author><persName><forename type="first">Ta</forename>
<surname>Lasko</surname>
</persName>
</author>
<author><persName><forename type="first">Se</forename>
<surname>Hauser</surname>
</persName>
</author>
</analytic>
<monogr><title level="m">Proc. SPIE</title>
<meeting>. SPIE</meeting>
<imprint><date type="published" when="2001-01"></date>
<biblScope unit="page" from="232" to="40"></biblScope>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b6"><analytic><title level="a" type="main">Character verification Internal technical report</title>
<author><persName><forename type="first">Z</forename>
<surname>Li</surname>
</persName>
</author>
</analytic>
<monogr><title level="j">Communications Engineering Branch</title>
<imprint><date type="published" when="2001-08-23"></date>
</imprint>
</monogr>
</biblStruct>
<biblStruct xml:id="b7"><monogr><title level="m" type="main">The tricks to make OCR work better. Imaging Magazine</title>
<author><persName><forename type="first">Moore</forename>
<forename type="middle">A</forename>
</persName>
</author>
<imprint><date type="published" when="1994-06"></date>
</imprint>
</monogr>
</biblStruct>
</listBibl>
</back>
</text>
</istex:refBibTEI>
</enrichments>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Istex/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001079 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 001079 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:5F8AE010ABA8590F21D6D07FAB40D8BF11EB0539
   |texte=   Text Verification in an Automated System for the Extraction of Bibliographic Data
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Text Verification in an Automated System for the Extraction of Bibliographic Data

Text Verification in an Automated System for the Extraction of Bibliographic Data

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri