Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels

Identifieur interne : 000034 ( Pmc/Checkpoint ); précédent : 000033; suivant : 000035

The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels

Auteurs : Robyn E. Drinkwater [Royaume-Uni] ; Robert W. N. Cubey [Royaume-Uni] ; Elspeth M. Haston [Royaume-Uni]

Source :

RBID : PMC:4086207

Abstract

At the Royal Botanic Garden Edinburgh (RBGE) the use of Optical Character Recognition (OCR) to aid the digitisation process has been investigated. This was tested using a herbarium specimen digitisation process with two stages of data entry. Records were initially batch-processed to add data extracted from the OCR text prior to being sorted based on Collector and/or Country. Using images of the specimens, a team of six digitisers then added data to the specimen records. To investigate whether the data from OCR aid the digitisation process, they completed a series of trials which compared the efficiency of data entry between sorted and unsorted batches of specimens. A survey was carried out to explore the opinion of the digitisation staff to the different sorting options. In total 7,200 specimens were processed.

When compared to an unsorted, random set of specimens, those which were sorted based on data added from the OCR were quicker to digitise. Of the methods tested here, the most successful in terms of efficiency used a protocol which required entering data into a limited set of fields and where the records were filtered by Collector and Country. The survey and subsequent discussions with the digitisation staff highlighted their preference for working with sorted specimens, in which label layout, locations and handwriting are likely to be similar, and so a familiarity with the Collector or Country is rapidly established.


Url:
DOI: 10.3897/phytokeys.38.7168
PubMed: 25009435
PubMed Central: 4086207


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:4086207

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels</title>
<author>
<name sortKey="Drinkwater, Robyn E" sort="Drinkwater, Robyn E" uniqKey="Drinkwater R" first="Robyn E." last="Drinkwater">Robyn E. Drinkwater</name>
<affiliation wicri:level="1">
<nlm:aff id="A1">Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR</wicri:regionArea>
<wicri:noRegion>EH3 5LR</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Cubey, Robert W N" sort="Cubey, Robert W N" uniqKey="Cubey R" first="Robert W. N." last="Cubey">Robert W. N. Cubey</name>
<affiliation wicri:level="1">
<nlm:aff id="A1">Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR</wicri:regionArea>
<wicri:noRegion>EH3 5LR</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Haston, Elspeth M" sort="Haston, Elspeth M" uniqKey="Haston E" first="Elspeth M." last="Haston">Elspeth M. Haston</name>
<affiliation wicri:level="1">
<nlm:aff id="A1">Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR</wicri:regionArea>
<wicri:noRegion>EH3 5LR</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25009435</idno>
<idno type="pmc">4086207</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4086207</idno>
<idno type="RBID">PMC:4086207</idno>
<idno type="doi">10.3897/phytokeys.38.7168</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000190</idno>
<idno type="wicri:Area/Pmc/Curation">000190</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000034</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels</title>
<author>
<name sortKey="Drinkwater, Robyn E" sort="Drinkwater, Robyn E" uniqKey="Drinkwater R" first="Robyn E." last="Drinkwater">Robyn E. Drinkwater</name>
<affiliation wicri:level="1">
<nlm:aff id="A1">Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR</wicri:regionArea>
<wicri:noRegion>EH3 5LR</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Cubey, Robert W N" sort="Cubey, Robert W N" uniqKey="Cubey R" first="Robert W. N." last="Cubey">Robert W. N. Cubey</name>
<affiliation wicri:level="1">
<nlm:aff id="A1">Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR</wicri:regionArea>
<wicri:noRegion>EH3 5LR</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Haston, Elspeth M" sort="Haston, Elspeth M" uniqKey="Haston E" first="Elspeth M." last="Haston">Elspeth M. Haston</name>
<affiliation wicri:level="1">
<nlm:aff id="A1">Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK</nlm:aff>
<country xml:lang="fr">Royaume-Uni</country>
<wicri:regionArea>Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR</wicri:regionArea>
<wicri:noRegion>EH3 5LR</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PhytoKeys</title>
<idno type="ISSN">1314-2011</idno>
<idno type="eISSN">1314-2003</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<label>Abstract</label>
<p>At the Royal Botanic Garden Edinburgh (RBGE) the use of Optical Character Recognition (OCR) to aid the digitisation process has been investigated. This was tested using a herbarium specimen digitisation process with two stages of data entry. Records were initially batch-processed to add data extracted from the OCR text prior to being sorted based on Collector and/or Country. Using images of the specimens, a team of six digitisers then added data to the specimen records. To investigate whether the data from OCR aid the digitisation process, they completed a series of trials which compared the efficiency of data entry between sorted and unsorted batches of specimens. A survey was carried out to explore the opinion of the digitisation staff to the different sorting options. In total 7,200 specimens were processed.</p>
<p>When compared to an unsorted, random set of specimens, those which were sorted based on data added from the OCR were quicker to digitise. Of the methods tested here, the most successful in terms of efficiency used a protocol which required entering data into a limited set of fields and where the records were filtered by Collector and Country. The survey and subsequent discussions with the digitisation staff highlighted their preference for working with sorted specimens, in which label layout, locations and handwriting are likely to be similar, and so a familiarity with the Collector or Country is rapidly established.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barber, A" uniqKey="Barber A">A Barber</name>
</author>
<author>
<name sortKey="Lafferty, D" uniqKey="Lafferty D">D Lafferty</name>
</author>
<author>
<name sortKey="Landrum, Lr" uniqKey="Landrum L">LR Landrum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beaman, Rs" uniqKey="Beaman R">RS Beaman</name>
</author>
<author>
<name sortKey="Cellinese, N" uniqKey="Cellinese N">N Cellinese</name>
</author>
<author>
<name sortKey="Heidorn, Pb" uniqKey="Heidorn P">PB Heidorn</name>
</author>
<author>
<name sortKey="Guo, Y" uniqKey="Guo Y">Y Guo</name>
</author>
<author>
<name sortKey="Green, Am" uniqKey="Green A">AM Green</name>
</author>
<author>
<name sortKey="Thiers, B" uniqKey="Thiers B">B Thiers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bebber, Dp" uniqKey="Bebber D">DP Bebber</name>
</author>
<author>
<name sortKey="Carine, Ma" uniqKey="Carine M">MA Carine</name>
</author>
<author>
<name sortKey="Wood, Jri" uniqKey="Wood J">JRI Wood</name>
</author>
<author>
<name sortKey="Wortley, Ah" uniqKey="Wortley A">AH Wortley</name>
</author>
<author>
<name sortKey="Harris, Dj" uniqKey="Harris D">DJ Harris</name>
</author>
<author>
<name sortKey="Prance, Gt" uniqKey="Prance G">GT Prance</name>
</author>
<author>
<name sortKey="Davidse, G" uniqKey="Davidse G">G Davidse</name>
</author>
<author>
<name sortKey="Paige, J" uniqKey="Paige J">J Paige</name>
</author>
<author>
<name sortKey="Pennington, Td" uniqKey="Pennington T">TD Pennington</name>
</author>
<author>
<name sortKey="Robson, Nkb" uniqKey="Robson N">NKB Robson</name>
</author>
<author>
<name sortKey="Scotland, Rw" uniqKey="Scotland R">RW Scotland</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Berendsohn, Wg" uniqKey="Berendsohn W">WG Berendsohn</name>
</author>
<author>
<name sortKey="Chavan, V" uniqKey="Chavan V">V Chavan</name>
</author>
<author>
<name sortKey="Macklin, Ja" uniqKey="Macklin J">JA Macklin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Davis, Ph" uniqKey="Davis P">PH Davis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Elith, J" uniqKey="Elith J">J Elith</name>
</author>
<author>
<name sortKey="Graham, Ch" uniqKey="Graham C">CH Graham</name>
</author>
<author>
<name sortKey="Anderson, Rp" uniqKey="Anderson R">RP Anderson</name>
</author>
<author>
<name sortKey="Dudik, M" uniqKey="Dudik M">M Dudik</name>
</author>
<author>
<name sortKey="Ferrier, S" uniqKey="Ferrier S">S Ferrier</name>
</author>
<author>
<name sortKey="Guisan, A" uniqKey="Guisan A">A Guisan</name>
</author>
<author>
<name sortKey="Hijmans, Rj" uniqKey="Hijmans R">RJ Hijmans</name>
</author>
<author>
<name sortKey="Huettmann, F" uniqKey="Huettmann F">F Huettmann</name>
</author>
<author>
<name sortKey="Leathwick, Jr" uniqKey="Leathwick J">JR Leathwick</name>
</author>
<author>
<name sortKey="Lehmann, A" uniqKey="Lehmann A">A Lehmann</name>
</author>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Lohmann, Lg" uniqKey="Lohmann L">LG Lohmann</name>
</author>
<author>
<name sortKey="Loiselle, Ba" uniqKey="Loiselle B">BA Loiselle</name>
</author>
<author>
<name sortKey="Manion, G" uniqKey="Manion G">G Manion</name>
</author>
<author>
<name sortKey="Moritz, C" uniqKey="Moritz C">C Moritz</name>
</author>
<author>
<name sortKey="Nakamura, M" uniqKey="Nakamura M">M Nakamura</name>
</author>
<author>
<name sortKey="Nakazawa, Y" uniqKey="Nakazawa Y">Y Nakazawa</name>
</author>
<author>
<name sortKey="Overton, Jmcc" uniqKey="Overton J">JMcC Overton</name>
</author>
<author>
<name sortKey="Peterson, At" uniqKey="Peterson A">AT Peterson</name>
</author>
<author>
<name sortKey="Phillips, Sj" uniqKey="Phillips S">SJ Phillips</name>
</author>
<author>
<name sortKey="Richardson, K" uniqKey="Richardson K">K Richardson</name>
</author>
<author>
<name sortKey="Scachetti Pereire, R" uniqKey="Scachetti Pereire R">R Scachetti-Pereire</name>
</author>
<author>
<name sortKey="Schapire, Re" uniqKey="Schapire R">RE Schapire</name>
</author>
<author>
<name sortKey="Sober N, J" uniqKey="Sober N J">J Soberón</name>
</author>
<author>
<name sortKey="Williams, S" uniqKey="Williams S">S Williams</name>
</author>
<author>
<name sortKey="Wisz, Ms" uniqKey="Wisz M">MS Wisz</name>
</author>
<author>
<name sortKey="Zimmerman, Ne" uniqKey="Zimmerman N">NE Zimmerman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hardisty, A" uniqKey="Hardisty A">A Hardisty</name>
</author>
<author>
<name sortKey="Roberts, D" uniqKey="Roberts D">D Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haston, E" uniqKey="Haston E">E Haston</name>
</author>
<author>
<name sortKey="Cubey, R" uniqKey="Cubey R">R Cubey</name>
</author>
<author>
<name sortKey="Harris, Dj" uniqKey="Harris D">DJ Harris</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haston, E" uniqKey="Haston E">E Haston</name>
</author>
<author>
<name sortKey="Cubey, R" uniqKey="Cubey R">R Cubey</name>
</author>
<author>
<name sortKey="Pullan, M" uniqKey="Pullan M">M Pullan</name>
</author>
<author>
<name sortKey="Atkins, H" uniqKey="Atkins H">H Atkins</name>
</author>
<author>
<name sortKey="Harris, D" uniqKey="Harris D">D Harris</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heidorn, Pb" uniqKey="Heidorn P">PB Heidorn</name>
</author>
<author>
<name sortKey="Wei, Q" uniqKey="Wei Q">Q Wei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hyam, R" uniqKey="Hyam R">R Hyam</name>
</author>
<author>
<name sortKey="Drinkwater, Re" uniqKey="Drinkwater R">RE Drinkwater</name>
</author>
<author>
<name sortKey="Harris, Dj" uniqKey="Harris D">DJ Harris</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lafferty, D" uniqKey="Lafferty D">D Lafferty</name>
</author>
<author>
<name sortKey="Landrum, Lr" uniqKey="Landrum L">LR Landrum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lavoie, C" uniqKey="Lavoie C">C Lavoie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lees, Dc" uniqKey="Lees D">DC Lees</name>
</author>
<author>
<name sortKey="Lack, Hw" uniqKey="Lack H">HW Lack</name>
</author>
<author>
<name sortKey="Rougerie, R" uniqKey="Rougerie R">R Rougerie</name>
</author>
<author>
<name sortKey="Hernandez Lopez, A" uniqKey="Hernandez Lopez A">A Hernandez-Lopez</name>
</author>
<author>
<name sortKey="Raus, T" uniqKey="Raus T">T Raus</name>
</author>
<author>
<name sortKey="Avtzis, Nd" uniqKey="Avtzis N">ND Avtzis</name>
</author>
<author>
<name sortKey="Augustin, S" uniqKey="Augustin S">S Augustin</name>
</author>
<author>
<name sortKey="Lopez Vaamonde, C" uniqKey="Lopez Vaamonde C">C Lopez-Vaamonde</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Ag" uniqKey="Miller A">AG Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Moen, We" uniqKey="Moen W">WE Moen</name>
</author>
<author>
<name sortKey="Huang, J" uniqKey="Huang J">J Huang</name>
</author>
<author>
<name sortKey="Mccotter, M" uniqKey="Mccotter M">M McCotter</name>
</author>
<author>
<name sortKey="Neill, A" uniqKey="Neill A">A Neill</name>
</author>
<author>
<name sortKey="Best, J" uniqKey="Best J">J Best</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nelson, G" uniqKey="Nelson G">G Nelson</name>
</author>
<author>
<name sortKey="Paul, D" uniqKey="Paul D">D Paul</name>
</author>
<author>
<name sortKey="Riccardi, G" uniqKey="Riccardi G">G Riccardi</name>
</author>
<author>
<name sortKey="Mast, Ar" uniqKey="Mast A">AR Mast</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Purves, D" uniqKey="Purves D">D Purves</name>
</author>
<author>
<name sortKey="Scharlemann, Jpw" uniqKey="Scharlemann J">JPW Scharlemann</name>
</author>
<author>
<name sortKey="Harfoot, M" uniqKey="Harfoot M">M Harfoot</name>
</author>
<author>
<name sortKey="Newbold, T" uniqKey="Newbold T">T Newbold</name>
</author>
<author>
<name sortKey="Tittensor, Dp" uniqKey="Tittensor D">DP Tittensor</name>
</author>
<author>
<name sortKey="Hutton, J" uniqKey="Hutton J">J Hutton</name>
</author>
<author>
<name sortKey="Emmott, S" uniqKey="Emmott S">S Emmott</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tulig, M" uniqKey="Tulig M">M Tulig</name>
</author>
<author>
<name sortKey="Tarnowsky, N" uniqKey="Tarnowsky N">N Tarnowsky</name>
</author>
<author>
<name sortKey="Bevans, M" uniqKey="Bevans M">M Bevans</name>
</author>
<author>
<name sortKey="Kirchgessner, A" uniqKey="Kirchgessner A">A Kirchgessner</name>
</author>
<author>
<name sortKey="Thiers, B" uniqKey="Thiers B">B Thiers</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PhytoKeys</journal-id>
<journal-id journal-id-type="iso-abbrev">PhytoKeys</journal-id>
<journal-id journal-id-type="publisher-id">PhytoKeys</journal-id>
<journal-title-group>
<journal-title>PhytoKeys</journal-title>
</journal-title-group>
<issn pub-type="ppub">1314-2011</issn>
<issn pub-type="epub">1314-2003</issn>
<publisher>
<publisher-name>Pensoft Publishers</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25009435</article-id>
<article-id pub-id-type="pmc">4086207</article-id>
<article-id pub-id-type="doi">10.3897/phytokeys.38.7168</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Drinkwater</surname>
<given-names>Robyn E.</given-names>
</name>
<xref ref-type="aff" rid="A1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Cubey</surname>
<given-names>Robert W. N.</given-names>
</name>
<xref ref-type="aff" rid="A1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Haston</surname>
<given-names>Elspeth M.</given-names>
</name>
<xref ref-type="aff" rid="A1">1</xref>
</contrib>
</contrib-group>
<aff id="A1">
<label>1</label>
Royal Botanic Garden Edinburgh, 20a Inverleith Row, Edinburgh, EH3 5LR, UK</aff>
<author-notes>
<corresp>Corresponding author: Elspeth M. Haston (
<email xlink:type="simple">e.haston@rbge.org.uk</email>
)</corresp>
<fn fn-type="edited-by">
<p>Academic editor: S. Knapp</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<pub-date pub-type="epub">
<day>19</day>
<month>5</month>
<year>2014</year>
</pub-date>
<issue>38</issue>
<fpage>15</fpage>
<lpage>30</lpage>
<history>
<date date-type="received">
<day>31</day>
<month>1</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>4</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>Robyn E. Drinkwater, Robert W. N. Cubey, Elspeth M. Haston</copyright-statement>
<license license-type="creative-commons-attribution" xlink:href="http://creativecommons.org/licenses/by/4.0">
<license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<abstract>
<label>Abstract</label>
<p>At the Royal Botanic Garden Edinburgh (RBGE) the use of Optical Character Recognition (OCR) to aid the digitisation process has been investigated. This was tested using a herbarium specimen digitisation process with two stages of data entry. Records were initially batch-processed to add data extracted from the OCR text prior to being sorted based on Collector and/or Country. Using images of the specimens, a team of six digitisers then added data to the specimen records. To investigate whether the data from OCR aid the digitisation process, they completed a series of trials which compared the efficiency of data entry between sorted and unsorted batches of specimens. A survey was carried out to explore the opinion of the digitisation staff to the different sorting options. In total 7,200 specimens were processed.</p>
<p>When compared to an unsorted, random set of specimens, those which were sorted based on data added from the OCR were quicker to digitise. Of the methods tested here, the most successful in terms of efficiency used a protocol which required entering data into a limited set of fields and where the records were filtered by Collector and Country. The survey and subsequent discussions with the digitisation staff highlighted their preference for working with sorted specimens, in which label layout, locations and handwriting are likely to be similar, and so a familiarity with the Collector or Country is rapidly established.</p>
</abstract>
<kwd-group>
<label>Keywords</label>
<kwd>OCR</kwd>
<kwd>Digitisation</kwd>
<kwd>Data entry</kwd>
<kwd>Specimen</kwd>
<kwd>Label</kwd>
<kwd>Herbarium</kwd>
<pmc-comment>PageBreak</pmc-comment>
</kwd-group>
</article-meta>
<notes>
<sec sec-type="Citation">
<title>Citation</title>
<p>Drinkwater RE, Cubey RWN, Haston EM (2014) The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels. PhytoKeys 38: 15–30. doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.3897/phytokeys.38.7168">10.3897/phytokeys.38.7168</ext-link>
</p>
</sec>
</notes>
</front>
<body>
<sec sec-type="Introduction">
<title>Introduction</title>
<p>There is an increasingly urgent need to document and make available the specimens held in herbaria and other natural history collections, particularly with the current biodiversity crisis (
<xref rid="B5" ref-type="bibr">Berendsohn et al. 2010</xref>
,
<xref rid="B9" ref-type="bibr">Hardisty et al. 2013</xref>
,
<xref rid="B21" ref-type="bibr">Purves et al. 2013</xref>
). The digitisation of the collections makes the data accessible for a wide range of taxonomic and ecological research being carried out around the world (e.g.
<xref rid="B8" ref-type="bibr">Elith et al. 2006</xref>
,
<xref rid="B4" ref-type="bibr">Bebber et al. 2010</xref>
,
<xref rid="B17" ref-type="bibr">Lees et al. 2011</xref>
,
<xref rid="B16" ref-type="bibr">Lavoie 2013</xref>
). The size of the collections held in major herbaria means that complete digitisation of the specimens they hold is often unfeasible, especially with the decreased funding at the present time.</p>
<p>At the Royal Botanical Garden, Edinburgh (RBGE), a large-scale project to digitise the collections has been running in which specimens are minimally databased (
<xref rid="B10" ref-type="bibr">Haston et al. 2012a</xref>
). Minimal data includes filing name and geographical region, as well as a barcode to act as a unique identifier. The high resolution, zoomable images of these specimens are made available through the online Herbarium Catalogue, accessed through the RBGE website (
<ext-link ext-link-type="uri" xlink:href="http://www.rbge.org.uk">www.rbge.org.uk</ext-link>
). They are also accessible via other online resources including Europeana (
<ext-link ext-link-type="uri" xlink:href="http://www.europeana.eu/">www.europeana.eu/</ext-link>
) and Genbank (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/genbank/">www.ncbi.nlm.nih.gov/genbank/</ext-link>
) using a stable URI system (
<xref rid="B13" ref-type="bibr">Hyam et al. 2012</xref>
). Whilst additional label data are not initially captured, they can be accessed by examining the specimen online. There are approximately 3 million specimens in the herbarium at RBGE; of these 630,000 have been databased with 30% only having minimal data attached.</p>
<p>A similar approach is being used at the New York Botanic Garden Herbarium which holds an estimated 7.3 million specimens, where they have been databasing and imaging the collection for 17 years (
<xref rid="B23" ref-type="bibr">Tulig et al. 2012</xref>
). Based on the work already completed they recently estimated that it would take a further 600,000 hours to completely database and image the remaining approx. 6 million specimens. They have brought in new protocols for partially databasing specimens, increasing the speed of processing from an average of 10 per hour to 125 per hour.</p>
<p>Whilst further information can be found through looking at images, data that are useful for biodiversity studies and other research are not easily available, and cannot be extracted from the database for use. These data can include location, habitat and a description of the plant. The next step in the process of databasing specimens must be to find ways of creating more complete and useful records, whilst minimising the need for a large investment in staff hours.</p>
<p>It is only recently that Optical Character Recognition (OCR) has started to be used more widely to aid with the digitisation of natural history collections (
<xref rid="B19" ref-type="bibr">Moen et al. 2008</xref>
,
<xref rid="B12" ref-type="bibr">Heidorn and Wei 2008</xref>
,
<xref rid="B20" ref-type="bibr">Nelson et al. 2012</xref>
) and literature relating to these collections such as the Biodiversity Heritage Library (
<xref rid="B6" ref-type="bibr">Biodiversity Heritage Library 2014</xref>
) which uses OCR output to help navigate the literature. As the quality of the software has improved, OCR has become a viable option, more able to cope with the complex tasks which can be presented by natural history objects; e.g. distinguishing between labels and plant material on a herbarium specimen. Another contributing factor to the
<pmc-comment>PageBreak</pmc-comment>
increased viability of OCR could be that there is now a large enough body of imaged specimens to make investment in OCR software worthwhile.</p>
<p>Several software applications have been developed to make use of OCR outputs easier. SALIX (
<xref rid="B15" ref-type="bibr">Lafferty and Landrum 2008</xref>
,
<xref rid="B2" ref-type="bibr">Barber et al. 2013</xref>
) and HERBIS (
<xref rid="B3" ref-type="bibr">Beaman et al. 2006</xref>
) parse the OCR output to a database, in a semi-automatic way, with the process being watched and facilitated by a user. Another approach (
<xref rid="B12" ref-type="bibr">Heidorn and Wei 2008</xref>
) has been to mark-up the output from the OCR, for input into the Darwin Core.
<xref rid="B22" ref-type="bibr">Silver Biology (2013)</xref>
is currently testing a site for enabling a citizen science initiative to database herbarium specimen labels. The OCR output is tagged with the relevant fields (e.g. Collector) and then parsed into Darwin Core fields. The use of OCR is also being explored by the AugmentOCR working group as part of Integrated Digitized Biocollections (iDigBio), the National Resource for Advancing Digitization of Biodiversity Collections (ADBC) funded by the National Science Foundation.</p>
<p>At RBGE we have been exploring how OCR processing can be used to add data to the minimal entries already created for specimens.</p>
<p>Whilst we have only just started to make use of data from OCR, the process of gathering this information has been integrated into the digitisation workflows since 2010. The workflows at RBGE have been developed in such a way that they are ‘modular’ (
<xref rid="B11" ref-type="bibr">Haston et al. 2012b</xref>
), to allow flexibility in the stages of digitising specimens. All specimen images are passed through ABBYY Recognition Server (
<xref rid="B1" ref-type="bibr">Abbyy 2014</xref>
) which provides the OCR output in the form of a text file. The unparsed text is automatically entered into a single field within a MySQL database. A PDF file with the OCR output overlaid on the image of the specimen is also saved.</p>
<p>The aims of this investigation are to examine how we can incorporate the OCR output into the workflows to make the digitisation process more efficient.</p>
<p>In particular we hope to be able to answer the following questions:</p>
<p>1. Can OCR speed up the digitisation process, whilst maintaining data quality?</p>
<p>2. Is OCR worth the investment in time and software?</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<p>To investigate how data extracted from OCR process can aid in the addition of data to minimal database records, a series of trials were carried out by six members of the digitisation team at RBGE.</p>
<p>The specimens used in this study were collected in Southwest Asia and the Middle East from the early 19
<sup>th</sup>
century to the present day. The earlier specimens are generally handwritten, but some have printed headings (
<xref ref-type="fig" rid="F1">Figure 1a</xref>
). Later specimens are generally type-written or printed (
<xref ref-type="fig" rid="F1">Figure 1b</xref>
and
<xref ref-type="fig" rid="F1">1c</xref>
). Specimens include those used in the writing of the Flora of Turkey (
<xref rid="B7" ref-type="bibr">Davis 1985</xref>
), and also the ongoing work on the Flora of Arabia (
<xref rid="B18" ref-type="bibr">Miller 1996</xref>
). This is a key focus region for research at RBGE and there are several members of staff who have considerable experience of collections from this area
<pmc-comment>PageBreak</pmc-comment>
and so they offer a valuable resource, being able to offer advice on difficult handwriting, cryptic notes on labels and terms to use when searching OCR text.</p>
<fig id="F1" orientation="portrait" position="float">
<label>Figure 1.</label>
<caption>
<p>Example labels:
<bold>a</bold>
Pre-printed label with handwritten details
<bold>b</bold>
and
<bold>c</bold>
mixed labels with pre-printed and typed information
<bold>d</bold>
Mainly handwritten label, with printers mark
<bold>e</bold>
Mainly handwritten label with unusual phrasing.</p>
</caption>
<graphic xlink:href="phytokeys-038-015-g001"></graphic>
</fig>
<p>These specimens have been imaged and minimally databased as part of an ongoing project to image and digitise RBGE herbarium specimens. The digitisation workflow includes the routine processing of all specimen images through ABBYY Recognition Server software, and the unparsed text output is stored within the images database.</p>
<p>For this study 20,000 specimen records were exported from the main database into a temporary Access database. The data included the minimal data fields, the image file
<pmc-comment>PageBreak</pmc-comment>
location and the OCR data. The OCR output was searched for Countries and Collector names, which were considered to be the most useful additional fields, as well as being the most likely to be easily ‘read’ by the OCR software. A short SQL script in Access was used to search for a selected word within the OCR text and, when present, to copy the word to a new field. As the specimens were from a limited geographical area, it allowed a list of Countries and major Collectors to be developed.</p>
<p>As well as carrying out simple searches for Country and Collector, other keywords and phrases were found to be peculiar to a particular Collector or Country. These included printers marks (
<xref ref-type="fig" rid="F1">Figure 1d</xref>
) on otherwise handwritten labels, unusual wording (
<xref ref-type="fig" rid="F1">Figure 1e</xref>
) or abbreviations used within pre-printed label headings. Common ‘reading’ errors made by the OCR software (e.g. the OCR software reading Turbey instead of Turkey), variations in spellings of provinces, states or cities were also useful in attaching an initial Country or Collector to a specimen.</p>
<p>The specimen records were then sorted by either Country or Collector, to enable verification of the data. This could be done rapidly using
<xref rid="B14" ref-type="bibr">IrfanView (2014)</xref>
a freeware graphic viewer, which was able to use the image file locations to create a slideshow of specimen images. This allowed specimens to be rapidly checked and the Collector and/or Country to be verified.</p>
<p>Once the Collector and Country had been verified, these data were added to the original specimen records using a batch process facility. This allowed a large number of records to be rapidly updated.</p>
<sec sec-type="Trial format">
<title>Trial format</title>
<p>The updated records were then used as the basis of a series of trials to assess how the data extracted from the OCR could be utilised in the wider digitisation effort at RBGE. The trials were set-up to look at rates for data entry with and without OCR data being used to aid the process.</p>
<p>The digitisation staff used the institutional database for data entry, allowing full use of the look up tables for collectors, countries and their top-level divisions, as well as a short-cut for repeat entry of fields. They were provided with two screens, one landscape and one portrait to allow for easy viewing of specimen images and organisation of other programmes required.</p>
<p>Each trial consisted of two protocols. These protocols differed in the amount of data being captured. The Complete Protocol involved the capture of all data on the specimen, including the original label as well as any additional determinations and annotations. The Partial Protocol limited the capture of data to a pre-determined standard set of fields including collector, collection number and date, locality information, and the taxon name under which it was originally collected. Twenty-four batches of records, each comprising 50 specimens, were created using a series of filters. These batches were then given to the team of digitisers.</p>
<pmc-comment>PageBreak</pmc-comment>
<p>
<bold>The six ‘filters’ used were:</bold>
</p>
<p>1. Pre-study control (Random)</p>
<p>2. Collector only</p>
<p>3. Country only</p>
<p>4. Collector and Country</p>
<p>5. Collector and Country, with full OCR output</p>
<p>6. Post-study control (Random)</p>
<p>Trial 1: Pre-study control</p>
<p>This first trial was used as a control and provided a baseline for the testing. The digitisers were each given two batches of randomly selected specimens which only contained minimal data.</p>
<p>Trial 2: Collector only</p>
<p>The digitisers were each given two batches of specimens which had been selected using a filter which ensured that all specimens in the batch had been collected by the same collector or collector group.</p>
<p>Trial 3: Country only</p>
<p>The digitisers were each given two batches of specimens which had been selected using a filter which ensured that all specimens in the batch had been collected in the same country.</p>
<p>Trial 4: Collector and Country</p>
<p>The digitisers were each given two batches of specimens which had been selected using a filter which ensured that all specimens in the batch had been collected in the same country and by the same collector or collector group.</p>
<pmc-comment>PageBreak</pmc-comment>
<p>Trial 5: Collector and Country, with full OCR output</p>
<p>The digitisers were each given two batches of specimens which had been selected using a filter which ensured that all specimens in the batch had been collected in the same country and by the same collector or collector group. In addition, a full OCR output was also provided. For this study the type of OCR output used was one where a PDF of the OCR output, layered over the top of the specimen image where text was detected. The digitisers were then asked to copy the OCR data into the appropriate fields and correct it for spelling and punctuation errors.</p>
<p>Trial 6: Post-study control</p>
<p>This final trial was used as a second control to assess how using the other methods, and increased familiarity with the process affected timings. The digitisers were each given two batches of randomly selected specimens which only contained minimal data.</p>
<p>The digitisers were asked to keep a record of the time it took to complete each set of specimens, excluding breaks.</p>
</sec>
<sec sec-type="Analysis">
<title>Analysis</title>
<p>The results of the tests were collated and an Analysis of Variance (ANOVA) was carried out in R. The digitisers were also asked to complete a short survey which explored the ‘people’ side of the work, asking about preferred workflows, their perception of time taken to complete tests and what resources may be of benefit to aid similar work in the future. The online questionnaire was followed up with an informal discussion of the trials, allowing points mentioned in the survey to be discussed further and also to discuss some of the wider implications of digitising specimens.</p>
</sec>
</sec>
<sec sec-type="Results">
<title>Results</title>
<p>The results of the study show significant differences in the average time taken for the trials to be completed. The level of variation observed between the trials differed between the Complete and Partial Protocols. Significant variation was observed between the trials completed using the Partial Protocol.</p>
<p>The results are summarised in
<xref ref-type="table" rid="T2">Table 2</xref>
.</p>
<table-wrap id="T1" orientation="portrait" position="float">
<label>Table 1.</label>
<caption>
<p>Format of the trials.</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<th rowspan="1" colspan="1">Trial</th>
<th rowspan="1" colspan="1">‘Filter’</th>
<th rowspan="1" colspan="1">Protocol</th>
<th rowspan="1" colspan="1">Number of repeats/person</th>
<th rowspan="1" colspan="1">Total specimens/person</th>
</tr>
<tr>
<td rowspan="2" colspan="1">1.</td>
<td rowspan="2" colspan="1">Random</td>
<td rowspan="1" colspan="1">Complete</td>
<td rowspan="2" colspan="1">2</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Partial</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="2" colspan="1">2.</td>
<td rowspan="2" colspan="1">Collector</td>
<td rowspan="1" colspan="1">Complete</td>
<td rowspan="2" colspan="1">2</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Partial</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="2" colspan="1">3.</td>
<td rowspan="2" colspan="1">Country</td>
<td rowspan="1" colspan="1">Complete</td>
<td rowspan="2" colspan="1">2</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Partial</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="2" colspan="1">4.</td>
<td rowspan="2" colspan="1">Collector & Country</td>
<td rowspan="1" colspan="1">Complete</td>
<td rowspan="2" colspan="1">2</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Partial</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="2" colspan="1">5.</td>
<td rowspan="2" colspan="1">Collector & Country(OCR)</td>
<td rowspan="1" colspan="1">Complete</td>
<td rowspan="2" colspan="1">2</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Partial</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="2" colspan="1">6.</td>
<td rowspan="2" colspan="1">Random</td>
<td rowspan="1" colspan="1">Complete</td>
<td rowspan="2" colspan="1">2</td>
<td rowspan="1" colspan="1">100</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Partial</td>
<td rowspan="1" colspan="1">100</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T2" orientation="portrait" position="float">
<label>Table 2.</label>
<caption>
<p>Average time taken to complete trials.</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<th rowspan="1" colspan="1">Trial</th>
<th rowspan="1" colspan="1">‘Filter’</th>
<th rowspan="1" colspan="1">Number of completed batches per Protocol</th>
<th rowspan="1" colspan="1">Average Complete Protocol (minutes)</th>
<th rowspan="1" colspan="1">% time saved (compared with Random 1)</th>
<th rowspan="1" colspan="1">Average Partial Protocol (minutes)</th>
<th rowspan="1" colspan="1">% time saved (compared with Random 1)</th>
</tr>
<tr>
<td rowspan="1" colspan="1">1.</td>
<td rowspan="1" colspan="1">Random 1</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">313</td>
<td rowspan="1" colspan="1">0%</td>
<td rowspan="1" colspan="1">226.9</td>
<td rowspan="1" colspan="1">0%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">2.</td>
<td rowspan="1" colspan="1">Collector</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">259.5</td>
<td rowspan="1" colspan="1">17.1%</td>
<td rowspan="1" colspan="1">220.2</td>
<td rowspan="1" colspan="1">2.7%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">3.</td>
<td rowspan="1" colspan="1">Country</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">345.7</td>
<td rowspan="1" colspan="1">10.5% increase</td>
<td rowspan="1" colspan="1">192.6</td>
<td rowspan="1" colspan="1">15.2%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4.</td>
<td rowspan="1" colspan="1">Collector & Country</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">262.8</td>
<td rowspan="1" colspan="1">16.1%</td>
<td rowspan="1" colspan="1">105.3</td>
<td rowspan="1" colspan="1">53.6%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5.</td>
<td rowspan="1" colspan="1">Collector & Country (OCR)</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">252.6</td>
<td rowspan="1" colspan="1">19.3%</td>
<td rowspan="1" colspan="1">125.7</td>
<td rowspan="1" colspan="1">44.7%</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6.</td>
<td rowspan="1" colspan="1">Random 2</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">283.9</td>
<td rowspan="1" colspan="1">9.3%</td>
<td rowspan="1" colspan="1">219.9</td>
<td rowspan="1" colspan="1">3.1%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Diagnostic plots were used to check that the data were normally distributed. There was evidence for some heteroscedascity in the data, so we cannot assume a normal distribution. A Poisson distribution was tested and compared with a normal distribution using AIC, which suggested that a normal distribution model fits the data better than a Poisson distribution model. We therefore present the results from the analyses based on a normal distribution.</p>
<pmc-comment>PageBreak</pmc-comment>
<p>The data were analysed to investigate the effect of Person on the trials, since this would have an impact on the analysis used. Firstly a linear regression was carried out treating each person as a factor. This suggested that the variation observed is explained by the Trial rather than the Person. Secondly co-plots were used to visualise the interactions of the person and the trials. These showed no significant effect of the person on the results, and the major effects were related to Trial. As a result of these analyses it was decided that one of the datasets should be excluded from the analyses as an outlier.</p>
<p>The Analysis of Variance (ANOVA) showed significant variation between the 12 trials.</p>
<pmc-comment>PageBreak</pmc-comment>
<p>The filters appear to have greater impact in the trials using the Partial Protocol. The Partial Protocol is used as the standard for the majority of databasing at RBGE. Therefore these trials were analysed further to explore this impact and the results are illustrated in the box plots (
<xref ref-type="fig" rid="F2">Figures 2</xref>
and
<xref ref-type="fig" rid="F3">3</xref>
).</p>
<fig id="F2" orientation="portrait" position="float">
<label>Figure 2.</label>
<caption>
<p>Box plot of Complete and Partial Protocol results. R1C – Random 1 complete; R1P – Random 1 Partial; CollC – Collector only Complete; CollP – Collector only Partial; CouC – Country only Complete; CouP – Country only Partial; CCC – Collector & Country Complete; CCP – Collector & Country Partial; OCRC – Collector & Country OCR Complete; OCRP – Collector & Country OCR Partial; R2C – Random 2 Complete; R2P – Random 2 Partial.</p>
</caption>
<graphic xlink:href="phytokeys-038-015-g002"></graphic>
</fig>
<fig id="F3" orientation="portrait" position="float">
<label>Figure 3.</label>
<caption>
<p>Box plot of Partial Protocol results.</p>
</caption>
<graphic xlink:href="phytokeys-038-015-g003"></graphic>
</fig>
<sec sec-type="Partial protocol">
<title>Partial protocol</title>
<p>The trials completed using the Partial Protocol show a significant reduction in the average time taken to add data to specimens which had been filtered by Country, by Collector and Country and by Collector and Country with OCR.</p>
<p>The greatest reduction in average times was seen in those specimens filtered by Collector and Country. The Country filter appeared to have the greatest impact on reducing the time.</p>
<p>The results of the ANOVA for the 6 trials are shown in
<xref ref-type="table" rid="T4">Table 4</xref>
. These were calculated using the Protocol ‘pairs’ (Complete and Partial). Three of the trials were found to have a result which was significant to greater than 0.001.</p>
<table-wrap id="T3" orientation="portrait" position="float">
<label>Table 3.</label>
<caption>
<p>Result of ANOVA for the 12 trials.</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">Df</th>
<th rowspan="1" colspan="1">F Value</th>
<th rowspan="1" colspan="1">Pr (>F)</th>
<th rowspan="1" colspan="1">Significance</th>
</tr>
<tr>
<td rowspan="1" colspan="1">Trial</td>
<td rowspan="1" colspan="1">11</td>
<td rowspan="1" colspan="1">13.03</td>
<td rowspan="1" colspan="1">4.11e-14</td>
<td rowspan="1" colspan="1">*** (0)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Residuals</td>
<td rowspan="1" colspan="1">85</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4" orientation="portrait" position="float">
<label>Table 4.</label>
<caption>
<p>Result of ANOVA using Protocol ‘pairs’ (Complete and Partial).</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<th rowspan="1" colspan="1">Trial</th>
<th rowspan="1" colspan="1"></th>
<th rowspan="1" colspan="1">Df</th>
<th rowspan="1" colspan="1">F Value</th>
<th rowspan="1" colspan="1">Pr (>F)</th>
<th rowspan="1" colspan="1">Significance</th>
</tr>
<tr>
<td rowspan="2" colspan="1">Partial</td>
<td rowspan="1" colspan="1">Trial</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">6.487</td>
<td rowspan="1" colspan="1">0.0013</td>
<td rowspan="1" colspan="1">** (0.001)</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Residuals</td>
<td rowspan="1" colspan="1">18</td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec sec-type="Survey">
<title>Survey</title>
<p>The survey was completed by all those who took part in the trials. The first five questions asked the digitisers to assign a value of 1-5 to each of tests, based on speed, ease of use, accuracy and preference.</p>
<pmc-comment>PageBreak</pmc-comment>
<p>Question 1: speed</p>
<p>The participants perceived the trials filtered by Country and Collector to be the fastest (66.7%) and the two random trials to be the slowest (66.7%).The use of OCR data to filter the specimens was perceived to be slightly faster than the Country only filter.</p>
<p>Question 2: ease of use</p>
<p>A similar result was found for the question asking the participants to rate the filters by their relative ease of use. Collector and Country was perceived to be the easiest to use (100%) and the hardest were the two random filters (66.7%).</p>
<p>Question 3: accuracy</p>
<p>Again the Collector and Country filter was perceived to be the least likely to lead to mistakes (83.3%) and the random filters were perceived to be the most likely to lead to mistakes (66.7% and 50%).</p>
<fig id="F4" orientation="portrait" position="float">
<label>Figure 4.</label>
<caption>
<p>Digitiser responses to questions 1, 2 and 3 of survey</p>
</caption>
<graphic xlink:href="phytokeys-038-015-g004"></graphic>
</fig>
<pmc-comment>PageBreak</pmc-comment>
<pmc-comment>PageBreak</pmc-comment>
<p>Questions 4 and 5: preference</p>
<p>The digitisers were asked which of the workflows would be preferred for digitising 50 and 1000 specimens. For 50 specimens there was a clear preference for the Collector and Country filter, with all participants selecting this filter. However when considering larger numbers of specimens the number selecting Collector and Country dropped, with 2 selecting the Country only filter.</p>
</sec>
</sec>
<sec sec-type="Discussion">
<title>Discussion</title>
<sec sec-type="Summary">
<title>Summary</title>
<p>This study investigated how data extracted from OCR can be used to sort specimens prior to databasing and aid in the addition of data to minimal database records. Of the methods tested here, the most successful in terms of efficiency used the Partial Protocol, filtered by Collector and Country. This method was on average 20 minutes (8.9%) faster per batch of 50 records than the next most efficient method.</p>
</sec>
<sec sec-type="Protocols: Complete and Partial">
<title>Protocols: Complete and Partial</title>
<p>As expected, the Complete Protocol which requires a larger quantity of data to be entered for each record resulted in a significant increase in the time taken to enter data. In particular, the need to enter multiple specimen determinations may often involve the creation of additional name records not already held in the database which can be time consuming. The amount of data on a label to be entered is a balance between usefulness and cost. For most users, we believe that the Partial Protocol, which places more emphasis on the geographical data, captures the highest priority information from the label.</p>
</sec>
<sec sec-type="Filters: Collector and Country">
<title>Filters: Collector and Country</title>
<p>Prior to the trials, there had been an expectation that filtering the records by Collector would have the greatest impact. This was not borne out during the trials. In fact, the greatest impact came from filtering the records by Country. From the feedback it was apparent that a familiarity with the geography of a country aids the digitisation process more than familiarity with a Collectors label style and handwriting. However, a combination of the Country filter with the Collector filter was found to be most effective in speeding up the data entry process.</p>
<p>This was also reflected in the feedback from the digitisation team, who all identified this combined filter as the preferred option for digitising 50 specimens, and the majority
<pmc-comment>PageBreak</pmc-comment>
would prefer this filter when digitising 1000 specimens. However, occasionally working with a large batch of similar records from a particular collector or country which were difficult in terms of legibility or geography resulted in reduced job satisfaction.</p>
</sec>
<sec sec-type="Variability">
<title>Variability</title>
<p>Whilst some of the trials showed a much greater variation in times to complete than others, the lack of variation between the preliminary random trial and the final random trial suggests that there was little ‘learning effect’.</p>
</sec>
<sec sec-type="Direct use of OCR data">
<title>Direct use of OCR data</title>
<p>The direct use of OCR output seems to have had very little effect on the time it took to digitise images. This may be due in part to the format of the output which did not allow users to copy multiple lines of text easily. More suitable output formats may increase the impact of the OCR output in the future.</p>
<p>The OCR output was most useful for long sections of text, often descriptions of the habitat and plant. However, some of the digitisers also found the output useful for shorter sections of texts, particularly place names.</p>
<p>In general, care needs to be taken in using the OCR output directly, as there can be some errors in punctuation, spelling and spacing. It is currently only of use for typed and printed labels, and not yet able to pick up hand-written ones, and so wasn’t available for all specimens encountered. In some cases the quality of the OCR output was so poor (spelling errors etc.) that it was quicker to type even the longer sections of text.</p>
</sec>
<sec sec-type="The Human factor">
<title>The Human factor</title>
<p>The results of the questionnaire and the subsequent discussion with the digitisers resulted in several interesting and unexpected points.</p>
</sec>
<sec sec-type="Preference for working with physical specimens">
<title>Preference for working with physical specimens</title>
<p>There was a clear preference expressed for working with physical specimens. One interesting point which was raised during the discussion with the digitisers, and which the authors hadn’t previously considered, was the preference for working with the actual specimen as opposed to the image of the specimen. Two main reasons for this came out of the discussion. Firstly they found that using a screen to view, read and interpret the label information can cause more strain on the eyes than looking at a physical specimen. Secondly they felt that the images of the specimens took more time to ma
<pmc-comment>PageBreak</pmc-comment>
nipulate and access the label information. The software we had provided the digitisers did not allow an easy zoom to the area of interest, whereas they felt that a physical specimen can be manipulated more easily and moved to make the label easier to read.</p>
</sec>
<sec sec-type="Working ‘methods’">
<title>Working ‘methods’</title>
<p>The digitisers also expressed the view that it was desirable for two people to work on similar sets of specimens since this gave them the opportunity to discuss and help each other. This was something which was not designed as part of these trials, but which came about because of the selection of specimen sets. This was more apparent for one set of specimens in which the handwriting on the labels was particularly difficult to read.</p>
<p>For the purpose of the trials we pre-filled some of the fields in the institutional database: Collector, Country or both, depending on the trial. The work carried out in the preparation of the batches which allowed the pre-filling of these fields meant that some issues, such as difficult handwriting of a collector’s name, did not have to be handled by the digitisers. This was seen as an advantage by the digitisation team.</p>
<p>In the questionnaire we asked the digitisation team to complete, we asked whether they thought any filters would lead to an increase, or reduction in mistakes in the data. Whilst this is something we haven’t quantified by checking the data entered during this investigation, it is interesting to note that the Collector and Country filter was felt to be least likely to lead to mistakes in the data.</p>
</sec>
<sec sec-type="Future work">
<title>Future work</title>
<p>This feedback from the digitisers has influenced how the next phase of the digitisation of the collection will develop. Where appropriate the digitisers will work in pairs enabling sharing of learning and expertise, and allowing discussion of problems encountered. Further to this, the digitisers felt it would be beneficial to have a resource which provided examples of collector’s hand-writing and locations for old or difficult names. There is also a need to take in to consideration the well-being of the digitisation staff, particularly with reference to the physical environment for repetitive tasks, something we will consider when developing the digitisation process in the future.</p>
<p>The use of OCR data will continue to be expanded for the digitisation of the collections in general. In particular this output is also likely to be of high quality for many of the more recent specimens, as they have clear type-written labels. For families like the
<named-content content-type="taxon-name">Zingiberaceae</named-content>
where the labels often have very long descriptions, partly because floral characters are lost once the specimen is pressed, access to the OCR output of the label would allow the full label to be easily added to the specimen record through a simple cut and paste. A future study of how working with physical versus virtual specimens and how this affects work flows for the digitisation process may be carried out in the future to help optimise practices at RBGE.</p>
<pmc-comment>PageBreak</pmc-comment>
<p>We are exploring other elements we could extract from the OCR output. These include numerical elements such as the Collection Number, Date, Latitude and Longitude, and Altitude. There is also potential to extract additional levels of locality information.</p>
<p>Some of the processes for pre-sorting herbarium specimens described here may be used in the future as part of crowd-sourcing projects. Opening up the data entry process beyond the trained digitisation staff would require the implementation of quality control checks which have not been carried out in this study.</p>
<p>Whilst we have found that the quality of OCR output to be variable depending on the condition of the label, it is expected that the software will continue to improve, allowing increasing amounts of data to be extracted.</p>
</sec>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>We would like to thank the digitisation at RBGE: Nicky Sharp, David Braidwood, Muhammad Ghazali, Lorna Glancy, Dorota Jaworska and Esther Nieto. The Andrew W Mellon Foundation and OpenUp!: Opening up the Natural History Heritage for Europeana for funding (Ref. No. 270890). Katherine O’Donnell for her help in the initial set-up of the trials. Dr Antje Ahrends (RBGE) & Dr Chris Glaseby (BIOSS) for statistical advice. We also thank Donat Agosti for his helpful and constructive comments.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<mixed-citation publication-type="other">
<source>Abby Recognition Server</source>
.
<ext-link ext-link-type="uri" xlink:href="http://www.abbyy.com/recognition_server/">http://www.abbyy.com/recognition_server/</ext-link>
<comment>[accessed 27.01.2014]</comment>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Barber</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lafferty</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Landrum</surname>
<given-names>LR</given-names>
</name>
</person-group>
(
<year>2013</year>
)
<article-title>The SALIX Method: A semi-automated workflow for herbarium specimen digitization.</article-title>
<source>Taxon</source>
<volume>62</volume>
(
<issue>3</issue>
):
<fpage>581</fpage>
-
<lpage>590</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.12705/623.16">10.12705/623.16</ext-link>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="other">
<person-group>
<name>
<surname>Beaman</surname>
<given-names>RS</given-names>
</name>
<name>
<surname>Cellinese</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Heidorn</surname>
<given-names>PB</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Thiers</surname>
<given-names>B</given-names>
</name>
</person-group>
(
<year>2006</year>
)
<article-title>HERBIS: Integrating digital imaging and label data capture for herbaria</article-title>
. In:
<source>Botany 2006: Botanical Cyberinfrastructure: Issues, Challenges, Opportunities, and Initiatives</source>
.
<ext-link ext-link-type="uri" xlink:href="http://2006.botanyconference.org/engine/search/index.php?func=detail&aid=402">http://2006.botanyconference.org/engine/search/index.php?func=detail&aid=402</ext-link>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Bebber</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Carine</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Wood</surname>
<given-names>JRI</given-names>
</name>
<name>
<surname>Wortley</surname>
<given-names>AH</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Prance</surname>
<given-names>GT</given-names>
</name>
<name>
<surname>Davidse</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Paige</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Pennington</surname>
<given-names>TD</given-names>
</name>
<name>
<surname>Robson</surname>
<given-names>NKB</given-names>
</name>
<name>
<surname>Scotland</surname>
<given-names>RW</given-names>
</name>
</person-group>
(
<year>2010</year>
)
<article-title>Herbaria are a major frontier for species discovery.</article-title>
<source>PNAS</source>
<volume>107</volume>
(
<issue>51</issue>
):
<fpage>22169</fpage>
-
<lpage>22171</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.1073/pnas.1011841108">10.1073/pnas.1011841108</ext-link>
<pub-id pub-id-type="pmid">21135225</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Berendsohn</surname>
<given-names>WG</given-names>
</name>
<name>
<surname>Chavan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Macklin</surname>
<given-names>JA</given-names>
</name>
</person-group>
(
<year>2010</year>
)
<article-title>Recommendations of the GBIF task group on the global strategy and action plan for the mobilization of natural history collections data.</article-title>
<source>Biodiversity Informatics</source>
<volume>7</volume>
:
<fpage>67</fpage>
-
<lpage>71</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.1016/j.ppees.2012.10.002">10.1016/j.ppees.2012.10.002</ext-link>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="other">
<source>Biodiveristy Heritage Library</source>
.
<ext-link ext-link-type="uri" xlink:href="http://www.biodiversitylibrary.org/">http://www.biodiversitylibrary.org/</ext-link>
<comment>[accessed 04.04.2014]</comment>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="other">
<person-group>
<name>
<surname>Davis</surname>
<given-names>PH</given-names>
</name>
</person-group>
(
<year>1985</year>
) Flora of Turkey and the east Aegean islands Vol. 1–9</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Elith</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Graham</surname>
<given-names>CH</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>RP</given-names>
</name>
<name>
<surname>Dudik</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ferrier</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Guisan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hijmans</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Huettmann</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Leathwick</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Lehmann</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lohmann</surname>
<given-names>LG</given-names>
</name>
<name>
<surname>Loiselle</surname>
<given-names>BA</given-names>
</name>
<name>
<surname>Manion</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Moritz</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Nakamura</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Nakazawa</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Overton</surname>
<given-names>JMcC</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>AT</given-names>
</name>
<name>
<surname>Phillips</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Richardson</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Scachetti-Pereire</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schapire</surname>
<given-names>RE</given-names>
</name>
<name>
<surname>Soberón</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wisz</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Zimmerman</surname>
<given-names>NE</given-names>
</name>
</person-group>
(
<year>2006</year>
)
<article-title>Novel methods improve prediction of species’ distributions from occurrence data.</article-title>
<source>Ecography</source>
<volume>29</volume>
:
<fpage>129</fpage>
-
<lpage>151</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.1111/j.2006.0906-7590.04596.x">10.1111/j.2006.0906-7590.04596.x</ext-link>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Hardisty</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>D</given-names>
</name>
</person-group>
and
<institution>The Biodiversity Informatics Community</institution>
(
<year>2013</year>
)
<article-title>A decadal view of biodiversity informatics: challenges and priorities.</article-title>
<source>BMC Ecology</source>
<volume>13</volume>
(
<issue>16</issue>
). doi:
<ext-link ext-link-type="doi" xlink:href="10.1186/1472-6785-13-16">10.1186/1472-6785-13-16</ext-link>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Haston</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Cubey</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>DJ</given-names>
</name>
</person-group>
(
<year>2012a</year>
)
<article-title>Data concepts and their relevance for data capture in large scale digitisation of biological collections.</article-title>
<source>International Journal of Humanities and Arts Computing</source>
<volume>6</volume>
:
<fpage>111</fpage>
-
<lpage>119</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.3366/ijhac.2012.0042">10.3366/ijhac.2012.0042</ext-link>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Haston</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Cubey</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Pullan</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Atkins</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>D</given-names>
</name>
</person-group>
(
<year>2012b</year>
)
<article-title>Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach.</article-title>
<source>ZooKeys</source>
<volume>209</volume>
:
<fpage>93</fpage>
-
<lpage>102</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.3897/zookeys.209.3121">10.3897/zookeys.209.3121</ext-link>
<pub-id pub-id-type="pmid">22859881</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="confproc">
<person-group>
<name>
<surname>Heidorn</surname>
<given-names>PB</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>Q</given-names>
</name>
</person-group>
(
<year>2008</year>
)
<article-title>Automatic Metadata Extraction From Museum Specimen Labels.</article-title>
<conf-name>International Conference on Dublin Core and Metadata Applications</conf-name>
:
<fpage>57</fpage>
<lpage>68</lpage>
<ext-link ext-link-type="uri" xlink:href="http://dcpapers.dublincore.org/pubs/article/viewFile/919/915">http://dcpapers.dublincore.org/pubs/article/viewFile/919/915</ext-link>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Hyam</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Drinkwater</surname>
<given-names>RE</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>DJ</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Stable citations for herbarium specimens on the internet: an illustration from a taxonomic revision of
<italic>Duboscia</italic>
(Malvaceae).</article-title>
<source>Phytotaxa</source>
<volume>73</volume>
:
<fpage>17</fpage>
-
<lpage>30</lpage>
<ext-link ext-link-type="uri" xlink:href="http://www.mapress.com/phytotaxa/content/2012/f/p00073p030f.pdf">http://www.mapress.com/phytotaxa/content/2012/f/p00073p030f.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="other">
<source>IrfanView</source>
.
<ext-link ext-link-type="uri" xlink:href="http://www.irfanview.com/">http://www.irfanview.com/</ext-link>
<comment>[accessed 27.01.2014]</comment>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="other">
<person-group>
<name>
<surname>Lafferty</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Landrum</surname>
<given-names>LR</given-names>
</name>
</person-group>
(
<year>2008</year>
)
<source>SALIX, the Semi-automatic Label Information Extraction system</source>
.
<ext-link ext-link-type="uri" xlink:href="http://nhc.asu.edu/vpherbarium/canotia/SALIX3.pdf">http://nhc.asu.edu/vpherbarium/canotia/SALIX3.pdf</ext-link>
<comment>[accessed 27.01.2014]</comment>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Lavoie</surname>
<given-names>C</given-names>
</name>
</person-group>
(
<year>2013</year>
)
<article-title>Biological collections in an ever changing world: Herbaria as tools for biogeographical and environmental studies.</article-title>
<source>Perspectives in Plant Ecology, Evolution and Systematics</source>
<volume>15</volume>
(
<issue>1</issue>
):
<fpage>68</fpage>
-
<lpage>76</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.1016/j.ppees.2012.10.002">10.1016/j.ppees.2012.10.002</ext-link>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Lees</surname>
<given-names>DC</given-names>
</name>
<name>
<surname>Lack</surname>
<given-names>HW</given-names>
</name>
<name>
<surname>Rougerie</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Hernandez-Lopez</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Raus</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Avtzis</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Augustin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lopez-Vaamonde</surname>
<given-names>C</given-names>
</name>
</person-group>
(
<year>2011</year>
)
<article-title>Tracking origins of invasive herbivores through herbaria and archival DNA: the case of the horse-chestnut leaf miner.</article-title>
<source>Frontiers in Ecology and the Environment</source>
<volume>9</volume>
(
<issue>6</issue>
):
<fpage>322</fpage>
-
<lpage>328</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.1890/100098">10.1890/100098</ext-link>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Miller</surname>
<given-names>AG</given-names>
</name>
</person-group>
(
<year>1996</year>
)
<source>Flora of the Arabian peninsula and Socotra</source>
<volume>vol. 1</volume>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="other">
<person-group>
<name>
<surname>Moen</surname>
<given-names>WE</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>McCotter</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Neill</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Best</surname>
<given-names>J</given-names>
</name>
</person-group>
(
<year>2008</year>
)
<source>Extraction and Parsing of Herbarium Specimen Data: Exploring the Use of the Dublin Core Application Profile Framework</source>
.
<ext-link ext-link-type="uri" xlink:href="https://www.ideals.illinois.edu/handle/2142/14920">https://www.ideals.illinois.edu/handle/2142/14920</ext-link>
<comment>[accessed 27.01.2014]</comment>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Nelson</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Paul</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Riccardi</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Mast</surname>
<given-names>AR</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Five task clusters that enable efficient and effective digitization of biological collections.</article-title>
<source>ZooKeys</source>
<volume>209</volume>
:
<fpage>19</fpage>
-
<lpage>45</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.3897/zookeys.209.3135">10.3897/zookeys.209.3135</ext-link>
<pub-id pub-id-type="pmid">22859876</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Purves</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Scharlemann</surname>
<given-names>JPW</given-names>
</name>
<name>
<surname>Harfoot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Newbold</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Tittensor</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Hutton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Emmott</surname>
<given-names>S</given-names>
</name>
</person-group>
(
<year>2013</year>
)
<article-title>Time to model all life on Earth.</article-title>
<source>Nature</source>
<volume>493</volume>
:
<fpage>295</fpage>
-
<lpage>297</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.1038/493295a">10.1038/493295a</ext-link>
<pub-id pub-id-type="pmid">23325192</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="other">
<source>Silver Biology</source>
.
<ext-link ext-link-type="uri" xlink:href="http://www.helpingscience.org/">http://www.helpingscience.org/</ext-link>
<comment>[accessed 20.11.2013]</comment>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<person-group>
<name>
<surname>Tulig</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tarnowsky</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bevans</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kirchgessner</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Thiers</surname>
<given-names>B</given-names>
</name>
</person-group>
(
<year>2012</year>
)
<article-title>Increasing the efficiency of digitization workflows for herbarium specimens.</article-title>
<source>ZooKeys</source>
<volume>209</volume>
:
<fpage>103</fpage>
-
<lpage>113</lpage>
. doi:
<ext-link ext-link-type="doi" xlink:href="10.3897/zookeys.209.3125">10.3897/zookeys.209.3125</ext-link>
<pub-id pub-id-type="pmid">22859882</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
<affiliations>
<list>
<country>
<li>Royaume-Uni</li>
</country>
</list>
<tree>
<country name="Royaume-Uni">
<noRegion>
<name sortKey="Drinkwater, Robyn E" sort="Drinkwater, Robyn E" uniqKey="Drinkwater R" first="Robyn E." last="Drinkwater">Robyn E. Drinkwater</name>
</noRegion>
<name sortKey="Cubey, Robert W N" sort="Cubey, Robert W N" uniqKey="Cubey R" first="Robert W. N." last="Cubey">Robert W. N. Cubey</name>
<name sortKey="Haston, Elspeth M" sort="Haston, Elspeth M" uniqKey="Haston E" first="Elspeth M." last="Haston">Elspeth M. Haston</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000034 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Checkpoint/biblio.hfd -nk 000034 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Checkpoint
   |type=    RBID
   |clé=     PMC:4086207
   |texte=   The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Checkpoint/RBID.i   -Sk "pubmed:25009435" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Checkpoint/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024