Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library

Identifieur interne : 000102 ( Ncbi/Merge ); précédent : 000101; suivant : 000103

Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library

Auteurs : Roderic Dm Page [Royaume-Uni]

Source :

RBID : PMC:3129327

Abstract

Background

The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive.

Description

A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site http://biostor.org/openurl/. This resolver can be used on the web, or called by bibliographic tools that support OpenURL.

Conclusions

BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from http://biostor.org/.


Url:
DOI: 10.1186/1471-2105-12-187
PubMed: 21605356
PubMed Central: 3129327

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:3129327

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library</title>
<author>
<name sortKey="Page, Roderic Dm" sort="Page, Roderic Dm" uniqKey="Page R" first="Roderic Dm" last="Page">Roderic Dm Page</name>
<affiliation wicri:level="4">
<nlm:aff id="I1">Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ</wicri:regionArea>
<orgName type="university">Université de Glasgow</orgName>
<placeName>
<settlement type="city">Glasgow</settlement>
<region type="country">Écosse</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21605356</idno>
<idno type="pmc">3129327</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3129327</idno>
<idno type="RBID">PMC:3129327</idno>
<idno type="doi">10.1186/1471-2105-12-187</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000149</idno>
<idno type="wicri:Area/Pmc/Curation">000149</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000130</idno>
<idno type="wicri:Area/Ncbi/Merge">000102</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library</title>
<author>
<name sortKey="Page, Roderic Dm" sort="Page, Roderic Dm" uniqKey="Page R" first="Roderic Dm" last="Page">Roderic Dm Page</name>
<affiliation wicri:level="4">
<nlm:aff id="I1">Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ</wicri:regionArea>
<orgName type="university">Université de Glasgow</orgName>
<placeName>
<settlement type="city">Glasgow</settlement>
<region type="country">Écosse</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive.</p>
</sec>
<sec>
<title>Description</title>
<p>A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/openurl/">http://biostor.org/openurl/</ext-link>
. This resolver can be used on the web, or called by bibliographic tools that support OpenURL.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/">http://biostor.org/</ext-link>
.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Lambert, O" uniqKey="Lambert O">O Lambert</name>
</author>
<author>
<name sortKey="Bianucci, G" uniqKey="Bianucci G">G Bianucci</name>
</author>
<author>
<name sortKey="Post, K" uniqKey="Post K">K Post</name>
</author>
<author>
<name sortKey="De Muizon, C" uniqKey="De Muizon C">C de Muizon</name>
</author>
<author>
<name sortKey="Salas Gismondi, R" uniqKey="Salas Gismondi R">R Salas-Gismondi</name>
</author>
<author>
<name sortKey="Urbina, M" uniqKey="Urbina M">M Urbina</name>
</author>
<author>
<name sortKey="Reumer, J" uniqKey="Reumer J">J Reumer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Melville, H" uniqKey="Melville H">H Melville</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koch, Ac" uniqKey="Koch A">AC Koch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lambert, O" uniqKey="Lambert O">O Lambert</name>
</author>
<author>
<name sortKey="Bianucci, G" uniqKey="Bianucci G">G Bianucci</name>
</author>
<author>
<name sortKey="Post, K" uniqKey="Post K">K Post</name>
</author>
<author>
<name sortKey="De Muizon, C" uniqKey="De Muizon C">C de Muizon</name>
</author>
<author>
<name sortKey="Salas Gismondi, R" uniqKey="Salas Gismondi R">R Salas-Gismondi</name>
</author>
<author>
<name sortKey="Urbina, M" uniqKey="Urbina M">M Urbina</name>
</author>
<author>
<name sortKey="Reumer, J" uniqKey="Reumer J">J Reumer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pilsk, S" uniqKey="Pilsk S">S Pilsk</name>
</author>
<author>
<name sortKey="Person, M" uniqKey="Person M">M Person</name>
</author>
<author>
<name sortKey="Deveer, J" uniqKey="Deveer J">J Deveer</name>
</author>
<author>
<name sortKey="Furfey, J" uniqKey="Furfey J">J Furfey</name>
</author>
<author>
<name sortKey="Kalfatovic, M" uniqKey="Kalfatovic M">M Kalfatovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cameron, Rd" uniqKey="Cameron R">RD Cameron</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Evenhuis, Nl" uniqKey="Evenhuis N">NL Evenhuis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alexander, Cp" uniqKey="Alexander C">CP Alexander</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Michaelsen, W" uniqKey="Michaelsen W">W Michaelsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lynch, Jd" uniqKey="Lynch J">JD Lynch</name>
</author>
<author>
<name sortKey="Ruiz Carranza, Pm" uniqKey="Ruiz Carranza P">PM Ruíz-Carranza</name>
</author>
<author>
<name sortKey="Ardila Robayo, Mc" uniqKey="Ardila Robayo M">MC Ardila-Robayo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, Q" uniqKey="Wei Q">Q Wei</name>
</author>
<author>
<name sortKey="Heidorn, Pb" uniqKey="Heidorn P">PB Heidorn</name>
</author>
<author>
<name sortKey="Freeland, C" uniqKey="Freeland C">C Freeland</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Holthuis, Lb" uniqKey="Holthuis L">LB Holthuis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schevill, We" uniqKey="Schevill W">WE Schevill</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schevill, We" uniqKey="Schevill W">WE Schevill</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Page, Rdm" uniqKey="Page R">RDM Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Sompel, Hv" uniqKey="De Sompel H">HV de Sompel</name>
</author>
<author>
<name sortKey="Beit Arie, O" uniqKey="Beit Arie O">O Beit-Arie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Page, Rdm" uniqKey="Page R">RDM Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smith, Tf" uniqKey="Smith T">TF Smith</name>
</author>
<author>
<name sortKey="Waterman, Ms" uniqKey="Waterman M">MS Waterman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Holt, Ewl" uniqKey="Holt E">EWL Holt</name>
</author>
<author>
<name sortKey="Tattersall, Wm" uniqKey="Tattersall W">WM Tattersall</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Von Ahn, L" uniqKey="Von Ahn L">L von Ahn</name>
</author>
<author>
<name sortKey="Maurer, B" uniqKey="Maurer B">B Maurer</name>
</author>
<author>
<name sortKey="Mcmillen, C" uniqKey="Mcmillen C">C McMillen</name>
</author>
<author>
<name sortKey="Abraham, D" uniqKey="Abraham D">D Abraham</name>
</author>
<author>
<name sortKey="Blum, M" uniqKey="Blum M">M Blum</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Nieukerken, Ej" uniqKey="Van Nieukerken E">EJ van Nieukerken</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raselimanana, Ap" uniqKey="Raselimanana A">AP Raselimanana</name>
</author>
<author>
<name sortKey="Raxworthy, Cj" uniqKey="Raxworthy C">CJ Raxworthy</name>
</author>
<author>
<name sortKey="Nussbaum, Ra" uniqKey="Nussbaum R">RA Nussbaum</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Henning, V" uniqKey="Henning V">V Henning</name>
</author>
<author>
<name sortKey="Reichelt, J" uniqKey="Reichelt J">J Reichelt</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Feitelson, Dg" uniqKey="Feitelson D">DG Feitelson</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lu, X" uniqKey="Lu X">X Lu</name>
</author>
<author>
<name sortKey="Kahle, B" uniqKey="Kahle B">B Kahle</name>
</author>
<author>
<name sortKey="Wang, Jz" uniqKey="Wang J">JZ Wang</name>
</author>
<author>
<name sortKey="Giles, Cl" uniqKey="Giles C">CL Giles</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lawrence, S" uniqKey="Lawrence S">S Lawrence</name>
</author>
<author>
<name sortKey="Giles, Cl" uniqKey="Giles C">CL Giles</name>
</author>
<author>
<name sortKey="Bollacker, K" uniqKey="Bollacker K">K Bollacker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Councill, Ig" uniqKey="Councill I">IG Councill</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Zhuang, Z" uniqKey="Zhuang Z">Z Zhuang</name>
</author>
<author>
<name sortKey="Debnath, S" uniqKey="Debnath S">S Debnath</name>
</author>
<author>
<name sortKey="Bolelli, L" uniqKey="Bolelli L">L Bolelli</name>
</author>
<author>
<name sortKey="Lee, Wc" uniqKey="Lee W">WC Lee</name>
</author>
<author>
<name sortKey="Sivasubramaniam, A" uniqKey="Sivasubramaniam A">A Sivasubramaniam</name>
</author>
<author>
<name sortKey="Giles, Cl" uniqKey="Giles C">CL Giles</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pocock, Ri" uniqKey="Pocock R">RI Pocock</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21605356</article-id>
<article-id pub-id-type="pmc">3129327</article-id>
<article-id pub-id-type="publisher-id">1471-2105-12-187</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-12-187</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Database</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Page</surname>
<given-names>Roderic DM</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>Roderic.Page@glasgow.ac.uk</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ, UK</aff>
<pub-date pub-type="collection">
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>23</day>
<month>5</month>
<year>2011</year>
</pub-date>
<volume>12</volume>
<fpage>187</fpage>
<lpage>187</lpage>
<history>
<date date-type="received">
<day>21</day>
<month>9</month>
<year>2010</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>5</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2011 Page; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2011</copyright-year>
<copyright-holder>Page; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/12/187"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>The Biodiversity Heritage Library (BHL) is a large digital archive of legacy biological literature, comprising over 31 million pages scanned from books, monographs, and journals. During the digitisation process basic metadata about the scanned items is recorded, but not article-level metadata. Given that the article is the standard unit of citation, this makes it difficult to locate cited literature in BHL. Adding the ability to easily find articles in BHL would greatly enhance the value of the archive.</p>
</sec>
<sec>
<title>Description</title>
<p>A service was developed to locate articles in BHL based on matching article metadata to BHL metadata using approximate string matching, regular expressions, and string alignment. This article locating service is exposed as a standard OpenURL resolver on the BioStor web site
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/openurl/">http://biostor.org/openurl/</ext-link>
. This resolver can be used on the web, or called by bibliographic tools that support OpenURL.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>BioStor provides tools for extracting, annotating, and visualising articles from the Biodiversity Heritage Library. BioStor is available from
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/">http://biostor.org/</ext-link>
.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>In July 2010 Lambert et al. [
<xref ref-type="bibr" rid="B1">1</xref>
] published a paper in
<italic>Nature </italic>
that described an extinct sperm whale possessing the biggest bite of any tetrapod known. They named this formidable predator
<italic>Leviathan melvillei</italic>
, the genus name
<italic>Leviathan </italic>
being derived from the Hebrew 'Livyatan', the species name honouring Herman Melville (author of Moby Dick [
<xref ref-type="bibr" rid="B2">2</xref>
]). As appropriate as this name was, it quickly ran foul of the rules of zoological nomenclature [
<xref ref-type="bibr" rid="B3">3</xref>
] because
<italic>Leviathan </italic>
had been used 169 years previously for an extinct species of mammoth [
<xref ref-type="bibr" rid="B4">4</xref>
]. Although the name
<italic>Leviathan </italic>
Koch [
<xref ref-type="bibr" rid="B4">4</xref>
] had lapsed into obscurity (as a synonym of
<italic>Mammut </italic>
Blummenbach) its existence meant the newly discovered whale had to be renamed, which it duly was a month after the original publication [
<xref ref-type="bibr" rid="B5">5</xref>
].</p>
<p>The fate of Lambert et al.'s
<italic>Leviathan </italic>
illustrates a significant challenge facing researchers finding and naming new species - the discoverability of existing names. In the absence of a global register of all taxonomic names that have ever been published, a researcher about to publish a new name may struggle to establish that that it has not already been used. Zoological nomenclature dates from 1758, botanical nomenclature from 1753, hence a comprehensive list of taxonomic names must survey some 250 years of literature [
<xref ref-type="bibr" rid="B6">6</xref>
], much of which is obscure and may not exist in digital form. Digitising this legacy literature is the goal of the Biodiversity Heritage Library (BHL) [
<xref ref-type="bibr" rid="B7">7</xref>
,
<xref ref-type="bibr" rid="B8">8</xref>
], a consortium of natural history museum libraries, botanic libraries, and research institutions. The bulk of this digitisation is carried out by the Internet Archive [
<xref ref-type="bibr" rid="B9">9</xref>
], which scans books (broadly defined to include bound issues of journals), creating a set of electronic files for each scanned item, which includes images of individual pages, and text extracted from those pages using Optical Character Recognition (OCR). BHL takes these files (together with the output from the scanning projects of individual BHL members), indexes them by bibliographic metadata and taxonomic names, and makes the content available on its web site [
<xref ref-type="bibr" rid="B7">7</xref>
] (both as web pages and web services). Although the bulk of BHL's scanning activities focus on pre-1923 content that is out of copyright, it has not inconsiderable post-1923 content contributed by its member institutions, notably publications by various natural history museums.</p>
<p>The inability to easily locate articles in BHL is a substantial obstacle to integrating this legacy biodiversity literature into mainstream scientific publishing. The goal of BioStor is to provide tools to locate and extract articles from the BHL archive. BioStor differs from search engines such as PubMed [
<xref ref-type="bibr" rid="B10">10</xref>
] and Google Scholar [
<xref ref-type="bibr" rid="B11">11</xref>
], which support free-form queries such as "what articles have been published on this topic?", or "what papers has this author published?" BioStor addresses a different question, namely "does this article exist in the BHL archive?" It is a tool to find out whether a specific article exists in the archive, as opposed to finding what articles exist on a particular topic.</p>
<sec>
<title>Locating articles in BHL</title>
<p>The BHL archive comprises "items" corresponding to physical objects which are scanned. Items are grouped together into "titles". A single volume book corresponds to a single title and item, whereas a multi-volume work, such as a journal, will comprise several items grouped under the same title (Figure
<xref ref-type="fig" rid="F1">1</xref>
). Noticeably absent from the BHL model is the standard unit of scientific citation, the article.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Simplified model of Biodiversity Heritage Library content</bold>
. Each scanned item comprises one or more page images. Items are grouped together into titles.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-1"></graphic>
</fig>
<p>For most modern articles the triple of journal name, volume, and starting page is sufficient to uniquely identify an article [
<xref ref-type="bibr" rid="B12">12</xref>
], and tools such as CrossRef's OpenURL resolver [
<xref ref-type="bibr" rid="B13">13</xref>
] can take this this triple and discover whether a Digital Object Identifier (DOI) [
<xref ref-type="bibr" rid="B14">14</xref>
] exists for a that article. Publishers make use of this tool to map the literature cited in a manuscript to the corresponding DOI. In an ideal world the BHL model of (title, item, page) (Figure
<xref ref-type="fig" rid="F1">1</xref>
) would map exactly to (journal, volume, page), such that an individual journal would correspond to a title in BHL, and each volume of that journal was a separate item. Given that BHL stores page numbers for each scanned page [
<xref ref-type="bibr" rid="B8">8</xref>
], locating articles would then be trivial and linking to BHL content could be readily integrated into existing publication processes, as well as bibliographic management tools that make use of CrossRef's services to augment user-provided metadata (e.g., Mendeley [
<xref ref-type="bibr" rid="B15">15</xref>
]).</p>
<p>Unfortunately, the actual mapping between articles and BHL content is often rather more complicated. Large articles (e.g., monographs) may be treated as separate "titles" (effectively as if they were books), rather than parts of the same title. A contributing library may have bound several volumes of a journal together, such that a single "item" may comprise multiple volumes. Volume numbers themselves may not be unique within a journal.
<italic>The Annals and Magazine of Natural History </italic>
(ISSN 0374-5481), published from 1828 until 1967 (being succeeded by the
<italic>Journal of Natural History</italic>
, ISSN 0022-2933), is divided into 13 "series", each series numbering its volumes from one onwards. Hence, "volume 1" of
<italic>Annals and Magazine of Natural History </italic>
may refer to any one of 13 volumes spanning 138 years [
<xref ref-type="bibr" rid="B16">16</xref>
]. Journals also differ in whether pagination is unique within a volume, or within parts of a volume. For example, in the journal
<italic>Arkiv för Zoologi </italic>
(ISSN 0004-2110) each article starts on page 1, so that the triple (
<italic>Arkiv för Zoologi</italic>
, 13, 1) may refer to [
<xref ref-type="bibr" rid="B17">17</xref>
,
<xref ref-type="bibr" rid="B18">18</xref>
], or any of 23 other articles in volume 13 of that journal.</p>
<p>Discovering articles also assumes that the pagination in BHL is complete and correct, and that one side of a sheet of paper corresponds to a "page". BHL records the page number of regular pages, but not pages that are classified as special in some way, such as title pages, or tables of contents. For example, page 1 in Lynch et al. [
<xref ref-type="bibr" rid="B19">19</xref>
] is recorded in BHL as being the title page without any number, which will frustrate efforts to locate this article by starting page alone.</p>
<p>While the triple (journal, volume, starting page) is usually sufficient - subject to the caveats above - to locate the start of an article, we want to recover all the pages in the article, hence we need both the starting and ending pages. Ideally we could then extract the corresponding set of page images from BHL and join them together to form an article. However, it is not uncommon for older articles to have discontinuous physical pagination, for example by having plates inserted between pages in the text. In some publications, such as
<italic>Isis von Oken</italic>
, the text on a page forms two columns, each with its own page number (Figure
<xref ref-type="fig" rid="F2">2</xref>
), hence one physical page need not equate to a bibliographic page.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Physical page with two page numbers</bold>
. Example of a physical page in the journal
<italic>Isis von Oken </italic>
with two columns, each of which as its own page number (249 and 250, respectively)</p>
</caption>
<graphic xlink:href="1471-2105-12-187-2"></graphic>
</fig>
</sec>
<sec>
<title>Metadata matters</title>
<p>Given that locating articles in a archive of legacy literature such as BHL is a non-trivial task, it is worth considering why such an undertaking is worthwhile, beyond integrating BHL with existing citation practices. Indeed, one could argue that, given that the OCR text for BHL content has been indexed by taxonomic name, the need for indexing by article has been greatly reduced - the user could simply search by taxonomic name and find the content they require. This would be sufficient for many users, especially if we were con fident that BHL had correctly indexed all the taxonomic names contained in the pages it has scanned. However, OCR errors mean that a significant fraction of names will be missed [
<xref ref-type="bibr" rid="B20">20</xref>
]. An obvious approach to discovering these missing names would be to take existing databases of taxonomic names and publications and search for those publications in BHL.</p>
<p>Metadata also provides ways for clients to aggregate and filter search results. The Encylopedia of Life [
<xref ref-type="bibr" rid="B21">21</xref>
] incorporates search results from BHL in its taxon pages, but the user has no obvious means of discovering whether the results are from the same article or not, nor can they order the results by date. As an example of one way the display of search results can be improved by sorting, consider the dispute concerning the correct scientific name for the sperm whale, which is debated in both the scientific literature [
<xref ref-type="bibr" rid="B22">22</xref>
-
<xref ref-type="bibr" rid="B24">24</xref>
] and, more vociferously, Wikipedia [
<xref ref-type="bibr" rid="B25">25</xref>
]. Being able to extract basic metadata from BHL would enable us to visualise the relative popularity of the two alternatives,
<italic>Physeter catodon </italic>
and
<italic>Physeter macrocephalus</italic>
, over time (Figure
<xref ref-type="fig" rid="F3">3</xref>
). With the obvious caveat that the literature in BHL is a biased sample of the taxonomic literature, it is clear that
<italic>Physeter macrocephalus </italic>
is the more commonly used name, but its usage peaked around the start of the twentieth century. By the 1950, the sperm whale was more commonly refered to as
<italic>Physeter catodon</italic>
. Navigating BHL content by date may help the user discover why the relative usage frequency of these two names changed in the previous century.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Usage of two names for the sperm whale over time</bold>
. Approximate distribution over time of two alternative names for the sperm whale (
<italic>Physeter catodon </italic>
and
<italic>Physeter macrocephalus</italic>
) in items scanned by the Biodiversity Heritage Library. Date of publication was extracted from the
<monospace>StartYear</monospace>
and
<monospace>EndYear</monospace>
fields of the
<monospace>Title</monospace>
table (see Fig. 4) using regular expressions.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-3"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Construction and content</title>
<p>A local copy of the core BHL tables (Figure
<xref ref-type="fig" rid="F4">4</xref>
) was created in MySQL using the data dump provided by BHL
<ext-link ext-link-type="uri" xlink:href="http://www.biodiversitylibrary.org/data/data.zip">http://www.biodiversitylibrary.org/data/data.zip</ext-link>
. Page images and OCR text for individual pages are retrieved as needed using the BHL API and cached locally (together with a thumbnail of the page image).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Simplified BHL schema</bold>
. Simplified database schema for the core tables in the Biodiversity Heritage Library. The fields referred to in the text are shown, together with a brief explanation of their contents.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-4"></graphic>
</fig>
<sec>
<title>Locating an article</title>
<p>BioStor provides an OpenURL [
<xref ref-type="bibr" rid="B26">26</xref>
] resolver service to locate articles in BHL. At a minimum the resolver requires the journal name, volume, and starting page of the article being searched for. It may also make use of journal series and date, if these are provided. This service first checks whether the article already exists in the BioStor database. If the article is not found, the algorithm outlined in Figure
<xref ref-type="fig" rid="F5">5</xref>
is used to search for the article in BHL.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Flow chart of algorithm for finding an article in BHL</bold>
. Steps 1-4 are explained in the text.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-5"></graphic>
</fig>
<sec>
<title>Step 1 - Finding the journal</title>
<p>The first step is to determine whether BHL includes the journal containing the article. BioStor uses a service provided by bioGUID [
<xref ref-type="bibr" rid="B27">27</xref>
,
<xref ref-type="bibr" rid="B28">28</xref>
] to find the ISSN [
<xref ref-type="bibr" rid="B29">29</xref>
] for the journal. If the bioGUID service returns an ISSN, the algorithm looks up the ISSN in the
<monospace>Title Identifier</monospace>
table (Figure
<xref ref-type="fig" rid="F1">1</xref>
) and retrieves the corresponding BHL
<monospace>TitleID</monospace>
. If the bioGUID service doesn't return a ISSN the algorithm attempts to find the journal title in the
<monospace>ShortTitle</monospace>
field in the
<monospace>Title</monospace>
table using approximate string matching. If it fails to find the title it then searches the
<monospace>VolumeInfo</monospace>
field in the
<monospace>Item</monospace>
table - for some journals (e.g.,
<italic>Fieldiana Zoology</italic>
, ISSN 0015-0754) the journal title is stored in that field. If at this point we can't find the journal we exit.</p>
</sec>
<sec>
<title>Step 2 - Finding scanned items for the journal</title>
<p>Ideally each journal corresponds to a single BHL title, but in some cases the same journal may be represented by more than one BHL title, and hence have more than one
<monospace>TitleID</monospace>
. Step 2 uses a hard-coded table of such cases to ensure that all items for a given journal are considered by Step 3.</p>
</sec>
<sec>
<title>Step 3 - Finding the volume and page</title>
<p>Ideally the
<monospace>VolumeInfo</monospace>
field in the
<monospace>Item</monospace>
table would contain just the volume number, however all manner of free-form text may be found there. The volume may be recorded as simple numbers or as strings, sometimes indicating volume, page or date ranges, notes on completeness of the volume, or other comments (e.g., "Index"). Metadata may also be in a variety of languages, such that the field may refer to "Volume", "Band", or "Tome". Nor is metadata always recorded consistently within a journal, for example the
<monospace>VolumeInfo</monospace>
field for scanned items belonging to the journal
<italic>Proceedings of the Zoological Society of London </italic>
contains strings such as:</p>
<p>• Part 1- Part 4 (1833-38)</p>
<p>• 1856</p>
<p>• 1901, v. 1 (Jan.-Apr.)</p>
<p>• Jan-Apr 1906</p>
<p>• 1912 v. 2</p>
<p>• 1923, pt. 1-2 (pp. 1-481)</p>
<p>BioStor uses a set of ad-hoc regular expressions to extract volume (and other information where present, such series, issue, and date) information from the
<monospace>VolumeInfo field</monospace>
. If no match to the target volume is found the algorithm exits.</p>
</sec>
<sec>
<title>Step 4 - Checking the match</title>
<p>At this stage in the algorithm we will have one or more candidates for the first page in the article. Multiple candidates may occur because the article has been scanned by more than one BHL contributor, or because there may be more than one article with the same metadata (see examples of
<italic>Annals and Magazine of Natural History </italic>
and
<italic>Arkiv för Zoologi </italic>
discussed above). Some of these matches can be filtered by series or date, if the user has supplied that information. For each remaining match we take the OCR text for the first page in the candidate and compare it to the article title by computing a local alignment between words in the page and word in the title using the Smith-Waterman [
<xref ref-type="bibr" rid="B30">30</xref>
] algorithm. Each pair of words that match exactly are scored +2, mismatches, deletions, and insertions are all scored -1. The score for the alignment is normalised by the match score × the number of words in the title, so that a perfect match has a score of 1. As an illustration, Figure
<xref ref-type="fig" rid="F6">6</xref>
shows the distribution of alignment scores for the
<italic>Annals and Magazine of Natural History</italic>
. Most articles in this journal have a score > 0.5, however some articles have very low scores due to poor OCR quality. For example, for the article "Preliminary notice of the Schizopoda collected by H. M.S. Discovery in the Antarctic region" [
<xref ref-type="bibr" rid="B31">31</xref>
] the corresponding OCR text is "Preltiniiiari/Xutice of I he Sc/ti:oj/0(/a collcxted hy 11. M.S. 'Dixcovenj' in the Antarctic Rec/io".</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Alignment scores for Annals and Magazine of Natural History</bold>
. Frequency distribution of scores for Smith-Waterman alignment between article title and OCR text for 314 articles from
<italic>Annals and Magazine of Natural History </italic>
in the Biodiversity Heritage Library.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-6"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Storing articles</title>
<p>Articles extracted from BHL are stored in the same MySQL database that stores the BHL tables, using a simple schema comprising a table for article bibliographic metadata, a table for authors, and a table that joins the authors to the individual articles they've authored. A further table joins the article to the BHL Page table (Figure
<xref ref-type="fig" rid="F7">7</xref>
).</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>Simplified BioStor database schema</bold>
. Simplified database schema for the core tables in the BioStor database.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-7"></graphic>
</fig>
</sec>
</sec>
<sec>
<title>Utility and Discussion</title>
<p>The BioStor database is available at
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/">http://biostor.org/</ext-link>
. It features an OpenURL resolver, and can display individual articles, lists of publications by author, by taxonomic name, and by journal. At the time of writing the database contains 26,784 articles extracted from BHL.</p>
<sec>
<title>OpenURL resolver</title>
<p>BioStor provides an OpenURL resolver at
<ext-link ext-link-type="uri" xlink:href="http://bioguid.info/openurl/">http://bioguid.info/openurl/</ext-link>
. If accessed using a web browser the user is presented with a form where they can enter the bibliographic details of an article individually (Figure
<xref ref-type="fig" rid="F8">8a</xref>
), or paste in a full citation and have BioStor attempt to parse it. BioStor's article parser uses regular expressions and is limited to simple citations of the form <(Year)>
. . : -. If the article is already in the BioStor database the article will be displayed, if not BioStor attempts to locate the article in BHL. If it finds potential matches, these are displayed to the user (Figure
<xref ref-type="fig" rid="F8">8b</xref>
). For each match the page displays the score based on Smith-Waterman alignment between the page OCR text and the article title. In the example shown in Figure
<xref ref-type="fig" rid="F8">8b</xref>
, there are three potential matches, two of which have high scores (they are duplicates resulting from two BHL contributors having scanned the same journal). A thumbnail of the first page in each possible match is shown, the user can click on this to view a larger version of the page if they wish to inspect the match more closely. If they are happy that one of the matches is indeed the article they were looking for, the user can fill in the reCAPTHCA test [
<xref ref-type="bibr" rid="B32">32</xref>
,
<xref ref-type="bibr" rid="B33">33</xref>
] and click on the corresponding button. BioStor will then retrieve the remaining page images and OCR text from BHL, store the article in its database, then display it to the user.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption>
<p>
<bold>BioStor OpenURL resolver</bold>
. (a) Example of using the web interface to the OpenURL resolver. The user has entered bibliographic details for the reference "On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy Rendall" [
<xref ref-type="bibr" rid="B53">53</xref>
]. (b) The resolver has found three possible matches in the Biodiversity Heritage Library. For each match the best alignment between the article title and the OCR text is highlighted in yellow. The user can then chose which match will be stored in BioStor.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-8"></graphic>
</fig>
<p>Cutting and pasting bibliographic details into web forms is tedious, so the web interface to the OpenURL resolver is intended for casual use only. Instead, it is envisaged that users will interact with the OpenURL resolver using one of the bibliographic tools that supports the protocol, such as EndNote [
<xref ref-type="bibr" rid="B34">34</xref>
] and Zotero [
<xref ref-type="bibr" rid="B35">35</xref>
], or a web browser that supports OpenURL ContextObject in SPAN (COinS) [
<xref ref-type="bibr" rid="B36">36</xref>
], such as Firefox with the OpenURL Referrer add on [
<xref ref-type="bibr" rid="B37">37</xref>
]. For example, the following OpenURL corresponds to the web form shown in Figure
<xref ref-type="fig" rid="F8">8a</xref>
(with line breaks added for clarity):</p>
<p>http://biostor.org/openurl</p>
<p>?genre=article</p>
<p>&atitle=On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy</p>
<p>Rendall</p>
<p>&title=Ann. Mag. nat. Hist.</p>
<p>&volume = 1</p>
<p>&spage = 308</p>
<p>&epage = 321</p>
<p>&date = 1898</p>
<p>Appending "&format=json" to the OpenURL returns the result in Javascript Object Notation (JSON), hence the service can be used as an API by other developers.</p>
</sec>
<sec>
<title>Retrieval performance</title>
<p>The ability of BioStor to find articles in BHL depends on several factors. An obvious reason BioStor may fail to find an article is that it simply has not been scanned by BHL. Alternatively, it may have been scanned by BHL but not yet added to the local copy of BHL used by BioStor. Even if an article exists in BHL, BioStor may fail to find it if the metadata describing the item that contains the article doesn't conform to one of the regular expressions BioStor uses to interpret the
<monospace>VolumeInfo</monospace>
field in the
<monospace>Item</monospace>
table. Because BioStor evaluates the quality of a match by comparing the title of the target article with the OCR text (Figure
<xref ref-type="fig" rid="F6">6</xref>
), OCR errors may result in the match being deemed too poor to be correct. If the metadata for the target article contains significant errors, such as incorrect pagination, then BioStor may also fail to find an article.</p>
<sec>
<title>Retrieval of articles in the journal Tijdschrift voor Entomologie</title>
<p>To provide a benchmark for BioStor's performance I used an EndNote database of 2330 articles from the journal
<italic>Tijdschrift voor Entomologie </italic>
spanning the years 1858 to 1999, inclusive, assembled by E. J. van Nieukerken as part of a complete index of the journal [
<xref ref-type="bibr" rid="B38">38</xref>
]. Almost all volumes of
<italic>Tijdschrift voor Entomologie </italic>
for this period have been scanned by BHL, so ideally BioStor should recover most, if not all of these articles from this journal. This database chosen because of the quality of the bibliographic metadata, and the fact it spanned some 150 years, during which time the typeface and layout of the journal changed significantly.</p>
<p>The EndNote file for
<italic>Tijdschrift voor Entomologie </italic>
was converted into a Research Information Systems (RIS) format file, which was then parsed by a script which extracted each article, constructed an OpenURL query, and forwarded it to BioStor, which returned a response in JSON format. The script scored recorded whether a match for article was found, ignoring matches with an alignment score of less than 0.5. As part of the output the script created web pages displaying details of each putative match including a thumbnail image of the first page of the article, making it possible to quickly evaluate whether the match was correct. The database, scripts, and HTML output are available from
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/ms/">http://biostor.org/ms/</ext-link>
.</p>
<p>Of the 2330 articles in the database, 94 articles are in volumes not presently available in BHL, and 224 articles have pages labelled with Roman numerals which weren't recorded by BHL. This left 2012 articles in the BHL archive, of which BioStor found matches for 1429 (71%), doing noticeably better for articles published after 1950 (Figure
<xref ref-type="fig" rid="F9">9</xref>
). Only fifteen matches (1%) were found to be incorrect, in each case due to pagination errors in the corresponding scanned items in BHL (typically the pagination recorded by BHL was offset from the correct pagination by 2-3 pages).</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption>
<p>
<bold>Success in locating articles from the journal Tijdschrift voor Entomologie</bold>
. Percentage of articles in the journal
<italic>Tijdschrift voor Entomologie </italic>
for the years 1858-1999 that BioStor found in the Biodiversity Heritage Library (BHL). 0% values represent volumes of
<italic>Tijdschrift voor Entomologie </italic>
that have not been scanned by BHL.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-9"></graphic>
</fig>
<p>
<italic>Tijdschrift voor Entomologie </italic>
is just one of the journals scanned by BHL, and it would be desirable to evaluate BioStor's performance across a range of journals. However, at present evaluation is hampered by the lack of freely available, comprehensive bibliographic databases for taxonomic journals.</p>
</sec>
</sec>
<sec>
<title>Displaying articles</title>
<p>Articles found by the OpenURL resolver are stored in the BioStor database, and given a unique URL of
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/reference/n">http://biostor.org/reference/n</ext-link>
where
<italic>n </italic>
is a unique integer. Figure
<xref ref-type="fig" rid="F10">10</xref>
shows an article [
<xref ref-type="bibr" rid="B39">39</xref>
] being displayed in BioStor. A simple Javascript-based viewer displays a single page as a image, with thumbnails of the all the pages in the article shown in a scrolling list. To minimise the time the article page takes to load the thumbnails are only loaded when visible using a delayed Javascript image loader [
<xref ref-type="bibr" rid="B40">40</xref>
]. The user can navigate through the article by clicking on the thumbnail for a given page. To smooth the transition between individual pages, when the user clicks on the thumbnail for a new page the thumbnail is displayed in place of the full page image while that page image loads. When the page image has loaded the low resolution thumbnail (which will appear fuzzy to the user) is replaced by the higher resolution image, giving the user the sensation that the page has come into focus.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption>
<p>
<bold>Example of page displaying an article in BioStor</bold>
. The article being displayed is [
<xref ref-type="bibr" rid="B39">39</xref>
].</p>
</caption>
<graphic xlink:href="1471-2105-12-187-10"></graphic>
</fig>
<p>The metadata (such as title, authors, journal name, etc.) can all be edited by the user. These edits will be saved if the user passes a reCAPTHCA test. The metadata can be retrieved in standard formats such as Reference Manager (RIS), Endnote XML, and BibTeX. The web page also contains bibliographic metadata embedded using the Context Object in Span (COinS) technique [
<xref ref-type="bibr" rid="B36">36</xref>
], and tags using the Dublin Core [
<xref ref-type="bibr" rid="B41">41</xref>
] and Google Scholar [
<xref ref-type="bibr" rid="B11">11</xref>
] vocabularies. The article itself can also be downloaded as a PDF file, with bibliographic metadata embedded using Adobe's Extensible Metadata Platform (XMP) [
<xref ref-type="bibr" rid="B42">42</xref>
]. Desktop bibliographic software that can read XMP, such as Mendeley [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B43">43</xref>
] and Papers [
<xref ref-type="bibr" rid="B44">44</xref>
], can extract this metadata so that the user need not manually re-enter bibliographic details for the paper.</p>
<p>The article page also displays the taxonomic and, where possible, geographic scope of the article. Taxonomic scope is represented by a tag cloud of the taxonomic names that BHL has found in the OCR text for the article, and by a taxonomic classification of those names based on the 2008 edition of the Catalogue of Life [
<xref ref-type="bibr" rid="B45">45</xref>
]. When an article is added to the BioStor database the OCR text is searched for strings that represent latitude and longitude values for point locations. Any points found are displayed on a Google Map.</p>
</sec>
<sec>
<title>Displaying authors</title>
<p>BioStor displays a summary page for each author in the database. To mitigate the problem of an author having more than one spelling of their name, BioStor clusters names using a web service provided by bioGUID [
<xref ref-type="bibr" rid="B27">27</xref>
], which implements Feitelson's [
<xref ref-type="bibr" rid="B46">46</xref>
] weighted clique algorithm for finding equivalent names. The summary page aggregates publications and coauthorships across this set of names. The page uses Exhibit [
<xref ref-type="bibr" rid="B47">47</xref>
] to create a faceted browser, enabling the user to browse an author's publications by date, journal, and coauthors.</p>
</sec>
<sec>
<title>Displaying journals</title>
<p>By default BioStor uses the ISSN to identify journals. Where a ISSN isn't available BioStor uses an OCLC number from the WorldCat service [
<xref ref-type="bibr" rid="B48">48</xref>
]. A user can see all the articles for a given journal by appending the journal's ISSN to the URL http://biostor.org/issn/ (or OCLC to the URL http://biostor.org/oclc/). The resulting web page lists the articles for that journal, as well as a graphical representation of how many articles for that journal have been located in BHL. Figure
<xref ref-type="fig" rid="F11">11</xref>
shows the coverage of the journal
<italic>Proceedings of the United States National Museum </italic>
(ISSN 0096-3801), published from 1878 to 1968.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption>
<p>
<bold>Summary of coverage of the journal Proceedings of the United States National Museum in BioStor</bold>
. Dark blue bars represent pages that have been assigned to an article in BioStor. A sparkline depicts the distribution of these articles over time.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-11"></graphic>
</fig>
</sec>
<sec>
<title>Displaying taxonomic names</title>
<p>If the user clicks on a name in the taxonomic tag cloud (Figure
<xref ref-type="fig" rid="F10">10</xref>
), or appends a taxonomic name (or uBio NameBankID [
<xref ref-type="bibr" rid="B49">49</xref>
]) to the URL http://bioguid.org/name/ for a name that has been taxonomically indexed by BHL, BioStor displays a web page listing the articles in BioStor that contain that name. The page also displays a sparkline showing the distribution of that name over time in the local copy of BHL, and lists taxonomic synonyms of the name according to the 2008 edition of the Catalogue of Life [
<xref ref-type="bibr" rid="B45">45</xref>
].</p>
</sec>
<sec>
<title>Searching and browsing</title>
<p>BioStor supports rudimentary full text search of author names and article titles. It also provides an interactive way to browse articles geographically using Google Maps
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/maps/">http://biostor.org/maps/</ext-link>
(Figure
<xref ref-type="fig" rid="F12">12</xref>
). When the user pans or zooms the map the web page displays the set of articles (up to a limit of 20) whose OCR text includes (latitude, longitude) pairs contained within the current bounds of the map.</p>
<fig id="F12" position="float">
<label>Figure 12</label>
<caption>
<p>
<bold>Browsing BioStor content geographically using Google Maps</bold>
. Listed below the map are the articles in the BioStor database with localities contained within the geographic area being displayed in the map.</p>
</caption>
<graphic xlink:href="1471-2105-12-187-12"></graphic>
</fig>
</sec>
<sec>
<title>Future directions</title>
<p>BioStor locates articles by matching existing bibliographies to BHL content, hence it relies on external sources of metadata to find articles. Typically these are bibliographies assembled by individual taxonomists for particular taxonomic groups, or lists of articles published in a single journal. An alternative approach would be to extract articles directly from the archive. Lu et al. [
<xref ref-type="bibr" rid="B50">50</xref>
] used feature extraction and a mixture of rule-based and machine-learning techniques to extract metadata from BHL OCR text, recovering between 66% to 94% of articles in selection of three journals. The set of articles in BioStor could be used as a training data set to help further develop these methods. Another approach to article extraction is crowd sourcing, where the task of identifying articles would be devolved to users. Ultimately, crowd sourcing could become important in cleaning metadata, but it may prove challenging to engage users in creating metadata from scratch.</p>
<p>The BHL archive has extracted taxonomic names from the OCR text, and BioStor looks for geographic localities encoded as latitude and longitude pairs. We could make more extensive use of the OCR text, for example by using autonomous citation indexing [
<xref ref-type="bibr" rid="B51">51</xref>
] to extract citations from the literature cited section of each article. These citations could in turn be feed into the BioStor OpenURL resolver to attempt to locate them in BHL. The combination of variable citation styles and OCR errors means that the same reference may have be represented by several different citations, requiring tools for cleaning and merging citation data (e.g., [
<xref ref-type="bibr" rid="B52">52</xref>
]).</p>
<p>BioStor is built as a service on the top of a copy of data from BHL, and creates a local bibliographic database of articles. One future direction would be to integrate this data with BHL itself. BHL has an OpenURL resolver
<ext-link ext-link-type="uri" xlink:href="http://www.biodiversitylibrary.org/openurlhelp.aspx">http://www.biodiversitylibrary.org/openurlhelp.aspx</ext-link>
that primarily supports books rather than articles. Adding metadata from BioStor could enhance the BHL OpenURL service, and provide the biodiversity community with a single source for BHL-derived content. BioStor content could also be added to other bibliographic databases, in particular Mendeley [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B43">43</xref>
]. Mendeley is developing an API for storing and retrieving documents and associated metadata, hence it might be possible to devolve the storing of basic bibliographic metadata to Mendeley, BioStor then becoming simply an OpenURL resolver.</p>
</sec>
</sec>
<sec>
<title>Conclusions</title>
<p>The 31 million scanned pages made available by the Biodiversity Heritage Library (BHL) represents a substantial resource of biological literature. BioStor provides an OpenURL resolver to locate articles in this archive. Each article extracted from BHL is given a unique URL, corresponding to a web page that displays the article pages, and information about the taxonomic names and geographic localities mentioned in the article. BioStor is available at
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/">http://biostor.org/</ext-link>
.</p>
</sec>
<sec>
<title>Availability and requirements</title>
<p>
<bold>Project Name: </bold>
BioStor</p>
<p>
<bold>Project Home Page: </bold>
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/">http://biostor.org/</ext-link>
. Source code is available from
<ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/bioguid/source/browse/#svn/trunk/biostor">http://code.google.com/p/bioguid/source/browse/#svn/trunk/biostor</ext-link>
.</p>
<p>
<bold>Operating System: </bold>
The BioStor web site is usable with any modern web browser. The source code can be easily installed on a Mac OS X, Linux server. It has not been tested on a Windows machine.</p>
<p>
<bold>Programming Language: </bold>
PHP</p>
<p>
<bold>Other Requirements: </bold>
Web server</p>
<p>
<bold>License: </bold>
GNU General Public License version 2</p>
<p>
<bold>Any restrictions to use by non-academics: </bold>
None</p>
</sec>
<sec>
<title>Abbreviations</title>
<p>API: Application Programming Interface; BHL: Biodiversity Heritage Library; DOI: Digital Object Identifier; ISSN: International Standard Serial Number; JSON: JavaScript Object Notation; OCR: Optical Character Recognition; URL: Uniform Resource Locator.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The author declares that they have no competing interests.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>The core data for BioStor comes from the Biodiversity Heritage Library [
<xref ref-type="bibr" rid="B7">7</xref>
]. Chris Freeland, Phil Cryer, and Mike Lichtenberg provided data dumps from BHL, and answered queries regarding the BHL database schema. E. J. van Nieukerken kindly provided the EndNote database for
<italic>Tijdschrift voor Entomologie</italic>
. I thank the anonymous referees for their comments.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Lambert</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Bianucci</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Post</surname>
<given-names>K</given-names>
</name>
<name>
<surname>de Muizon</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Salas-Gismondi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urbina</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Reumer</surname>
<given-names>J</given-names>
</name>
<article-title>The giant bite of a new raptorial sperm whale from the Miocene epoch of Peru</article-title>
<source>Nature</source>
<year>2010</year>
<volume>466</volume>
<issue>7302</issue>
<fpage>105</fpage>
<lpage>108</lpage>
<pub-id pub-id-type="doi">10.1038/nature09067</pub-id>
<pub-id pub-id-type="pmid">20596020</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="book">
<name>
<surname>Melville</surname>
<given-names>H</given-names>
</name>
<source>Moby-Dick</source>
<year>1851</year>
<publisher-name>Richard Bentley, London</publisher-name>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="book">
<collab>International Commission on Zoological Nomenclature</collab>
<source>International code of zoological nomenclature. International Trust for Zoological Nomenclature</source>
<year>1999</year>
<edition>4</edition>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="book">
<name>
<surname>Koch</surname>
<given-names>AC</given-names>
</name>
<source>Description of the Missourium, or Missouri Leviathan: together with its supposed habits and Indian traditions concerning the location from whence it was exhumed; also, comparisons of the whale, crocodile and missourium with the leviathan, as described in 41st chapter of the book of Job</source>
<year>1841</year>
<edition>2</edition>
<publisher-name>Prentice and Weissinger</publisher-name>
<ext-link ext-link-type="uri" xlink:href="http://www.biodiversitylibrary.org/item/81522">http://www.biodiversitylibrary.org/item/81522</ext-link>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Lambert</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Bianucci</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Post</surname>
<given-names>K</given-names>
</name>
<name>
<surname>de Muizon</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Salas-Gismondi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urbina</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Reumer</surname>
<given-names>J</given-names>
</name>
<article-title>The giant bite of a new raptorial sperm whale from the Miocene epoch of Peru</article-title>
<source>Nature</source>
<year>2010</year>
<volume>466</volume>
<issue>7310</issue>
<fpage>1134</fpage>
<pub-id pub-id-type="doi">10.1038/nature09381</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<collab>Anonymous</collab>
<article-title>The legacy of Linnaeus</article-title>
<source>Nature</source>
<year>2007</year>
<volume>446</volume>
<fpage>231</fpage>
<lpage>232</lpage>
<pub-id pub-id-type="pmid">17361138</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="other">
<article-title>Biodiversity Heritage Library</article-title>
<ext-link ext-link-type="uri" xlink:href="http://biodiversitylibrary.org">http://biodiversitylibrary.org</ext-link>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<name>
<surname>Pilsk</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Person</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Deveer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Furfey</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kalfatovic</surname>
<given-names>M</given-names>
</name>
<article-title>The Biodiversity Heritage Library: Advancing Metadata Practices in a Collaborative Digital Library</article-title>
<source>Journal of Library Metadata</source>
<year>2010</year>
<volume>10</volume>
<issue>2</issue>
<fpage>136</fpage>
<lpage>155</lpage>
<pub-id pub-id-type="doi">10.1080/19386389.2010.506400</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="other">
<article-title>Internet Archive</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.archive.org/">http://www.archive.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="other">
<article-title>PubMed</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pubmed/">http://www.ncbi.nlm.nih.gov/pubmed/</ext-link>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="other">
<article-title>Google Scholar</article-title>
<ext-link ext-link-type="uri" xlink:href="http://scholar.google.com/">http://scholar.google.com/</ext-link>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="other">
<name>
<surname>Cameron</surname>
<given-names>RD</given-names>
</name>
<article-title>Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention</article-title>
<source>Tech. Rep. CMPT TR 1998-08, School of Computing Science, Simon Fraser University</source>
<year>1998</year>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="other">
<article-title>CrossRef OpenURL</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.crossref.org/openurl">http://www.crossref.org/openurl</ext-link>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="other">
<article-title>The Digital Object Identifier System</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.doi.org/">http://www.doi.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="other">
<article-title>Mendeley</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.mendeley.com/">http://www.mendeley.com/</ext-link>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Evenhuis</surname>
<given-names>NL</given-names>
</name>
<article-title>Publication and dating of the journals forming the
<italic>Annals and Magazine of Natural History </italic>
and the
<italic>Journal of Natural History</italic>
</article-title>
<source>Zootaxa</source>
<year>2003</year>
<volume>385</volume>
<fpage>1</fpage>
<lpage>68</lpage>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Alexander</surname>
<given-names>CP</given-names>
</name>
<article-title>The crane-flies collected by the Swedish expedition (1895-1896) to southern Chile and Tierra del Fuego (Tipulidae, Diptera)</article-title>
<source>Arkiv för Zoologi</source>
<year>1920</year>
<volume>13</volume>
<issue>6</issue>
<fpage>1</fpage>
<lpage>32</lpage>
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/reference/13820">http://biostor.org/reference/13820</ext-link>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<name>
<surname>Michaelsen</surname>
<given-names>W</given-names>
</name>
<article-title>Neue und wenig bekannte Oligochäten aus skandinavischen Sammlungen</article-title>
<source>Arkiv för Zoologi</source>
<year>1921</year>
<volume>13</volume>
<issue>19</issue>
<fpage>1</fpage>
<lpage>25</lpage>
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/reference/14784">http://biostor.org/reference/14784</ext-link>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Lynch</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Ruíz-Carranza</surname>
<given-names>PM</given-names>
</name>
<name>
<surname>Ardila-Robayo</surname>
<given-names>MC</given-names>
</name>
<article-title>The identities of the Colombian frogs confused with
<italic>Eleutherodactylus latidiscus </italic>
(Boulenger) (Amphibia: Anura: Leptodactylidae)</article-title>
<source>Occasional Papers of the Museum of Natural History University of Kansas</source>
<year>1994</year>
<volume>170</volume>
<fpage>1</fpage>
<lpage>42</lpage>
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/reference/228">http://biostor.org/reference/228</ext-link>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="other">
<name>
<surname>Wei</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Heidorn</surname>
<given-names>PB</given-names>
</name>
<name>
<surname>Freeland</surname>
<given-names>C</given-names>
</name>
<article-title>Name Matters: Taxonomic Name Recognition (TNR) in Biodiversity Heritage Library (BHL)</article-title>
<source>iConference 2010 Proceedings</source>
<year>2010</year>
<fpage>284</fpage>
<lpage>288</lpage>
<ext-link ext-link-type="uri" xlink:href="http://hdl.handle.net/2142/14919">http://hdl.handle.net/2142/14919</ext-link>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="other">
<article-title>Encylopedia of Life</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.eol.org/">http://www.eol.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>Holthuis</surname>
<given-names>LB</given-names>
</name>
<article-title>The Scientific Name of the Sperm Whale</article-title>
<source>Marine Mammal Science</source>
<year>1987</year>
<volume>3</volume>
<fpage>87</fpage>
<lpage>89</lpage>
<pub-id pub-id-type="doi">10.1111/j.1748-7692.1987.tb00154.x</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Schevill</surname>
<given-names>WE</given-names>
</name>
<article-title>Mr. Schevill replies</article-title>
<source>Marine Mammal Science</source>
<year>1987</year>
<volume>3</volume>
<fpage>89</fpage>
<lpage>90</lpage>
<pub-id pub-id-type="doi">10.1111/j.1748-7692.1987.tb00155.x</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="journal">
<name>
<surname>Schevill</surname>
<given-names>WE</given-names>
</name>
<article-title>The International Code of Zoological Nomenclature and a paradigm: the name
<italic>Physeter catodon </italic>
Linnaeus 1758</article-title>
<source>Marine Mammal Science</source>
<year>1986</year>
<volume>2</volume>
<issue>2</issue>
<fpage>153</fpage>
<lpage>157</lpage>
<pub-id pub-id-type="doi">10.1111/j.1748-7692.1986.tb00036.x</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Page</surname>
<given-names>RDM</given-names>
</name>
<article-title>Wikipedia as an encyclopaedia of life</article-title>
<source>Organisms Diversity and Evolution</source>
<year>2010</year>
<volume>10</volume>
<issue>4</issue>
<fpage>343</fpage>
<lpage>349</lpage>
<pub-id pub-id-type="doi">10.1007/s13127-010-0028-9</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>de Sompel</surname>
<given-names>HV</given-names>
</name>
<name>
<surname>Beit-Arie</surname>
<given-names>O</given-names>
</name>
<article-title>Open Linking in the Scholarly Information Environment Using the OpenURL Framework</article-title>
<source>D-Lib Magazine</source>
<year>2001</year>
<volume>7</volume>
<issue>3</issue>
<pub-id pub-id-type="doi">10.1045/march2001-vandesompel</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="journal">
<name>
<surname>Page</surname>
<given-names>RDM</given-names>
</name>
<article-title>bioGUID: resolving, discovering, and minting identifiers for biodiversity informatics</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<issue>Suppl 14</issue>
<fpage>S5</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-S14-S5</pub-id>
<pub-id pub-id-type="pmid">19900301</pub-id>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="other">
<article-title>bioGUID</article-title>
<ext-link ext-link-type="uri" xlink:href="http://bioguid.info/">http://bioguid.info/</ext-link>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="other">
<article-title>ISSN International Centre</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.issn.org">http://www.issn.org</ext-link>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Smith</surname>
<given-names>TF</given-names>
</name>
<name>
<surname>Waterman</surname>
<given-names>MS</given-names>
</name>
<article-title>Identification of common molecular subsequences</article-title>
<source>Journal of Molecular Biology</source>
<year>1981</year>
<volume>147</volume>
<fpage>195</fpage>
<lpage>197</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(81)90087-5</pub-id>
<pub-id pub-id-type="pmid">7265238</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="journal">
<name>
<surname>Holt</surname>
<given-names>EWL</given-names>
</name>
<name>
<surname>Tattersall</surname>
<given-names>WM</given-names>
</name>
<article-title>Preliminary notice of the Schizopoda collected by H. M.S. Discovery in the Antarctic region</article-title>
<source>Ann Mag Nat Hist</source>
<year>1906</year>
<volume>17</volume>
<fpage>1</fpage>
<lpage>11</lpage>
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/reference/50163">http://biostor.org/reference/50163</ext-link>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="other">
<article-title>reCAPTCHA</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.google.com/recaptcha">http://www.google.com/recaptcha</ext-link>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal">
<name>
<surname>von Ahn</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Maurer</surname>
<given-names>B</given-names>
</name>
<name>
<surname>McMillen</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Abraham</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Blum</surname>
<given-names>M</given-names>
</name>
<article-title>reCAPTCHA: Human-Based Character Recognition via Web Security Measures</article-title>
<source>Science</source>
<year>2008</year>
<volume>321</volume>
<issue>5895</issue>
<fpage>1465</fpage>
<lpage>1468</lpage>
<pub-id pub-id-type="doi">10.1126/science.1160379</pub-id>
<pub-id pub-id-type="pmid">18703711</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="other">
<article-title>EndNote</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.endnote.com/">http://www.endnote.com/</ext-link>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="other">
<article-title>Zotero</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.zotero.org/">http://www.zotero.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="other">
<article-title>OpenURL ContextObject in SPAN (COinS)</article-title>
<ext-link ext-link-type="uri" xlink:href="http://ocoins.info/">http://ocoins.info/</ext-link>
</mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="other">
<article-title>OpenURL Referrer</article-title>
<ext-link ext-link-type="uri" xlink:href="https://addons.mozilla.org/en-US/firefox/addon/4150">https://addons.mozilla.org/en-US/firefox/addon/4150</ext-link>
</mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="journal">
<name>
<surname>van Nieukerken</surname>
<given-names>EJ</given-names>
</name>
<article-title>Tijdschrift voor Entomologie 150 volumes: one and a half century of Systematic Entomology in a changing world</article-title>
<source>Tijdschrift voor Entomologie</source>
<year>2007</year>
<volume>1</volume>
<issue>2</issue>
<fpage>245</fpage>
<lpage>261</lpage>
<ext-link ext-link-type="uri" xlink:href="http://www.repository.naturalis.nl/document/93299">http://www.repository.naturalis.nl/document/93299</ext-link>
</mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="journal">
<name>
<surname>Raselimanana</surname>
<given-names>AP</given-names>
</name>
<name>
<surname>Raxworthy</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Nussbaum</surname>
<given-names>RA</given-names>
</name>
<article-title>A revision of the dwarf
<italic>Zonosaurus </italic>
Boulenger (Reptilia: Squamata: Cordylidae) from Madagascar, including descriptions of three new species</article-title>
<source>Scientific Papers Natural History Museum University of Kansas</source>
<year>2000</year>
<volume>18</volume>
<fpage>1</fpage>
<lpage>16</lpage>
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/reference/50335">http://biostor.org/reference/50335</ext-link>
</mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="other">
<article-title>lazierLoad - Javascript Image Lazy Loader for Prototype</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.bram.us/projects/js_bramus/lazierload/">http://www.bram.us/projects/js_bramus/lazierload/</ext-link>
</mixed-citation>
</ref>
<ref id="B41">
<mixed-citation publication-type="other">
<article-title>Dublin Core Metadata Initiative</article-title>
<ext-link ext-link-type="uri" xlink:href="http://dublincore.org/">http://dublincore.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B42">
<mixed-citation publication-type="other">
<article-title>Adobe XMP</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.adobe.com/products/xmp/index.html">http://www.adobe.com/products/xmp/index.html</ext-link>
</mixed-citation>
</ref>
<ref id="B43">
<mixed-citation publication-type="other">
<name>
<surname>Henning</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Reichelt</surname>
<given-names>J</given-names>
</name>
<article-title>Mendeley - A Last.fm For Research?</article-title>
<source>eScience '08. IEEE Fourth International Conference on eScience, 2008</source>
<year>2008</year>
<fpage>327</fpage>
<lpage>328</lpage>
<pub-id pub-id-type="pmid">21565935</pub-id>
</mixed-citation>
</ref>
<ref id="B44">
<mixed-citation publication-type="other">
<article-title>Papers</article-title>
<ext-link ext-link-type="uri" xlink:href="http://mekentosj.com/papers/">http://mekentosj.com/papers/</ext-link>
</mixed-citation>
</ref>
<ref id="B45">
<mixed-citation publication-type="other">
<article-title>The Species 2000 and ITIS Catalogue of Life</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.catalogueoflife.org">http://www.catalogueoflife.org</ext-link>
</mixed-citation>
</ref>
<ref id="B46">
<mixed-citation publication-type="journal">
<name>
<surname>Feitelson</surname>
<given-names>DG</given-names>
</name>
<article-title>On identifying name equivalences in digital libraries</article-title>
<source>Information Research</source>
<year>2004</year>
<volume>9</volume>
<ext-link ext-link-type="uri" xlink:href="http://informationr.net/ir/9-4/paper192.html">http://informationr.net/ir/9-4/paper192.html</ext-link>
</mixed-citation>
</ref>
<ref id="B47">
<mixed-citation publication-type="other">
<article-title>Exhibit: Publishing Framework for Data-Rich Interactive Web Pages</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.simile-widgets.org/exhibit/">http://www.simile-widgets.org/exhibit/</ext-link>
</mixed-citation>
</ref>
<ref id="B48">
<mixed-citation publication-type="other">
<article-title>WorldCat.org: The World's Largest Library Catalog</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.worldcat.org/">http://www.worldcat.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B49">
<mixed-citation publication-type="other">
<article-title>Universal Biological Indexer and Organizer (uBio)</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.ubio.org/">http://www.ubio.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B50">
<mixed-citation publication-type="other">
<name>
<surname>Lu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Kahle</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>JZ</given-names>
</name>
<name>
<surname>Giles</surname>
<given-names>CL</given-names>
</name>
<article-title>A metadata generation system for scanned scientific volumes</article-title>
<source>Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries</source>
<year>2008</year>
<fpage>167</fpage>
<lpage>179</lpage>
<pub-id pub-id-type="doi">10.1145/1378889.1378918</pub-id>
</mixed-citation>
</ref>
<ref id="B51">
<mixed-citation publication-type="journal">
<name>
<surname>Lawrence</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Giles</surname>
<given-names>CL</given-names>
</name>
<name>
<surname>Bollacker</surname>
<given-names>K</given-names>
</name>
<article-title>Digital libraries and autonomous citation indexing</article-title>
<source>IEEE COMPUTER</source>
<year>1999</year>
<volume>32</volume>
<issue>6</issue>
<fpage>67</fpage>
<lpage>71</lpage>
<pub-id pub-id-type="doi">10.1109/2.769447</pub-id>
</mixed-citation>
</ref>
<ref id="B52">
<mixed-citation publication-type="book">
<name>
<surname>Councill</surname>
<given-names>IG</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Zhuang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Debnath</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bolelli</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>WC</given-names>
</name>
<name>
<surname>Sivasubramaniam</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Giles</surname>
<given-names>CL</given-names>
</name>
<article-title>Learning metadata from the evidence in an on-line citation matching scheme</article-title>
<source>JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries</source>
<year>2006</year>
<publisher-name>New York, NY, USA: ACM</publisher-name>
<fpage>276</fpage>
<lpage>285</lpage>
<pub-id pub-id-type="doi">10.1145/1141753.1141817</pub-id>
</mixed-citation>
</ref>
<ref id="B53">
<mixed-citation publication-type="journal">
<name>
<surname>Pocock</surname>
<given-names>RI</given-names>
</name>
<article-title>On the Arachnida taken in the Transvaal and in Nyasaland by Mr W. L. Distant and Dr Percy Rendall</article-title>
<source>Ann Mag nat Hist</source>
<year>1898</year>
<volume>1</volume>
<fpage>308</fpage>
<lpage>321</lpage>
<ext-link ext-link-type="uri" xlink:href="http://biostor.org/reference/52084">http://biostor.org/reference/52084</ext-link>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
<affiliations>
<list>
<country>
<li>Royaume-Uni</li>
</country>
<region>
<li>Écosse</li>
</region>
<settlement>
<li>Glasgow</li>
</settlement>
<orgName>
<li>Université de Glasgow</li>
</orgName>
</list>
<tree>
<country name="Royaume-Uni">
<region name="Écosse">
<name sortKey="Page, Roderic Dm" sort="Page, Roderic Dm" uniqKey="Page R" first="Roderic Dm" last="Page">Roderic Dm Page</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Ncbi/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000102 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd -nk 000102 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Ncbi
   |étape=   Merge
   |type=    RBID
   |clé=     PMC:3129327
   |texte=   Extracting scientific articles from a large digital archive: BioStor and the Biodiversity Heritage Library
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/RBID.i   -Sk "pubmed:21605356" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024