Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Applications of Natural Language Processing in Biodiversity Science

Identifieur interne : 000210 ( Pmc/Curation ); précédent : 000209; suivant : 000211

Applications of Natural Language Processing in Biodiversity Science

Auteurs : Anne E. Thessen [États-Unis] ; Hong Cui [États-Unis] ; Dmitry Mozzherin [États-Unis]

Source :

RBID : PMC:3364545

Abstract

Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.


Url:
DOI: 10.1155/2012/391574
PubMed: 22685456
PubMed Central: 3364545

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:3364545

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Applications of Natural Language Processing in Biodiversity Science</title>
<author>
<name sortKey="Thessen, Anne E" sort="Thessen, Anne E" uniqKey="Thessen A" first="Anne E." last="Thessen">Anne E. Thessen</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Cui, Hong" sort="Cui, Hong" uniqKey="Cui H" first="Hong" last="Cui">Hong Cui</name>
<affiliation wicri:level="1">
<nlm:aff id="I2">School of Information Resources and Library Science, University of Arizona, Tucson, AZ 85719, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>School of Information Resources and Library Science, University of Arizona, Tucson, AZ 85719</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Mozzherin, Dmitry" sort="Mozzherin, Dmitry" uniqKey="Mozzherin D" first="Dmitry" last="Mozzherin">Dmitry Mozzherin</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22685456</idno>
<idno type="pmc">3364545</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3364545</idno>
<idno type="RBID">PMC:3364545</idno>
<idno type="doi">10.1155/2012/391574</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000210</idno>
<idno type="wicri:Area/Pmc/Curation">000210</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Applications of Natural Language Processing in Biodiversity Science</title>
<author>
<name sortKey="Thessen, Anne E" sort="Thessen, Anne E" uniqKey="Thessen A" first="Anne E." last="Thessen">Anne E. Thessen</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Cui, Hong" sort="Cui, Hong" uniqKey="Cui H" first="Hong" last="Cui">Hong Cui</name>
<affiliation wicri:level="1">
<nlm:aff id="I2">School of Information Resources and Library Science, University of Arizona, Tucson, AZ 85719, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>School of Information Resources and Library Science, University of Arizona, Tucson, AZ 85719</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Mozzherin, Dmitry" sort="Mozzherin, Dmitry" uniqKey="Mozzherin D" first="Dmitry" last="Mozzherin">Dmitry Mozzherin</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Advances in Bioinformatics</title>
<idno type="ISSN">1687-8027</idno>
<idno type="eISSN">1687-8035</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science. </p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Wuethrich, B" uniqKey="Wuethrich B">B Wuethrich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bradshaw, We" uniqKey="Bradshaw W">WE Bradshaw</name>
</author>
<author>
<name sortKey="Holzapfel, Cm" uniqKey="Holzapfel C">CM Holzapfel</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thessen, Ae" uniqKey="Thessen A">AE Thessen</name>
</author>
<author>
<name sortKey="Patterson, Dj" uniqKey="Patterson D">DJ Patterson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hey, A" uniqKey="Hey A">A Hey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stein, Ld" uniqKey="Stein L">LD Stein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heidorn, Pb" uniqKey="Heidorn P">PB Heidorn</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vollmar, A" uniqKey="Vollmar A">A Vollmar</name>
</author>
<author>
<name sortKey="Macklin, Ja" uniqKey="Macklin J">JA Macklin</name>
</author>
<author>
<name sortKey="Ford, L" uniqKey="Ford L">L Ford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schofield, Pn" uniqKey="Schofield P">PN Schofield</name>
</author>
<author>
<name sortKey="Eppig, J" uniqKey="Eppig J">J Eppig</name>
</author>
<author>
<name sortKey="Huala, E" uniqKey="Huala E">E Huala</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Groth, P" uniqKey="Groth P">P Groth</name>
</author>
<author>
<name sortKey="Gibson, A" uniqKey="Gibson A">A Gibson</name>
</author>
<author>
<name sortKey="Velterop, J" uniqKey="Velterop J">J Velterop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kalfatovic, M" uniqKey="Kalfatovic M">M Kalfatovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tang, X" uniqKey="Tang X">X Tang</name>
</author>
<author>
<name sortKey="Heidorn, P" uniqKey="Heidorn P">P Heidorn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
<author>
<name sortKey="Selden, P" uniqKey="Selden P">P Selden</name>
</author>
<author>
<name sortKey="Boufford, D" uniqKey="Boufford D">D Boufford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Taylor, A" uniqKey="Taylor A">A Taylor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miyao, Y" uniqKey="Miyao Y">Y Miyao</name>
</author>
<author>
<name sortKey="Sagae, K" uniqKey="Sagae K">K Sagae</name>
</author>
<author>
<name sortKey="S Tre, R" uniqKey="S Tre R">R Sætre</name>
</author>
<author>
<name sortKey="Matsuzaki, T" uniqKey="Matsuzaki T">T Matsuzaki</name>
</author>
<author>
<name sortKey="Tsujii, J" uniqKey="Tsujii J">J Tsujii</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Humphreys, K" uniqKey="Humphreys K">K Humphreys</name>
</author>
<author>
<name sortKey="Demetriou, G" uniqKey="Demetriou G">G Demetriou</name>
</author>
<author>
<name sortKey="Gaizauskas, R" uniqKey="Gaizauskas R">R Gaizauskas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gaizauskas, R" uniqKey="Gaizauskas R">R Gaizauskas</name>
</author>
<author>
<name sortKey="Demetriou, G" uniqKey="Demetriou G">G Demetriou</name>
</author>
<author>
<name sortKey="Artymiuk, Pj" uniqKey="Artymiuk P">PJ Artymiuk</name>
</author>
<author>
<name sortKey="Willett, P" uniqKey="Willett P">P Willett</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Divoli, A" uniqKey="Divoli A">A Divoli</name>
</author>
<author>
<name sortKey="Attwood, Tk" uniqKey="Attwood T">TK Attwood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Corney, Dpa" uniqKey="Corney D">DPA Corney</name>
</author>
<author>
<name sortKey="Buxton, Bf" uniqKey="Buxton B">BF Buxton</name>
</author>
<author>
<name sortKey="Langdon, Wb" uniqKey="Langdon W">WB Langdon</name>
</author>
<author>
<name sortKey="Jones, Dt" uniqKey="Jones D">DT Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, H" uniqKey="Chen H">H Chen</name>
</author>
<author>
<name sortKey="Sharp, Bm" uniqKey="Sharp B">BM Sharp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhou, X" uniqKey="Zhou X">X Zhou</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X Zhang</name>
</author>
<author>
<name sortKey="Hu, X" uniqKey="Hu X">X Hu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rebholz Schuhmann, D" uniqKey="Rebholz Schuhmann D">D Rebholz-Schuhmann</name>
</author>
<author>
<name sortKey="Kirsch, H" uniqKey="Kirsch H">H Kirsch</name>
</author>
<author>
<name sortKey="Arregui, M" uniqKey="Arregui M">M Arregui</name>
</author>
<author>
<name sortKey="Gaudan, S" uniqKey="Gaudan S">S Gaudan</name>
</author>
<author>
<name sortKey="Riethoven, M" uniqKey="Riethoven M">M Riethoven</name>
</author>
<author>
<name sortKey="Stoehr, P" uniqKey="Stoehr P">P Stoehr</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hu, Zz" uniqKey="Hu Z">ZZ Hu</name>
</author>
<author>
<name sortKey="Mani, I" uniqKey="Mani I">I Mani</name>
</author>
<author>
<name sortKey="Hermoso, V" uniqKey="Hermoso V">V Hermoso</name>
</author>
<author>
<name sortKey="Liu, H" uniqKey="Liu H">H Liu</name>
</author>
<author>
<name sortKey="Wu, Ch" uniqKey="Wu C">CH Wu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Demaine, J" uniqKey="Demaine J">J Demaine</name>
</author>
<author>
<name sortKey="Martin, J" uniqKey="Martin J">J Martin</name>
</author>
<author>
<name sortKey="Wei, L" uniqKey="Wei L">L Wei</name>
</author>
<author>
<name sortKey="De Bruijn, B" uniqKey="De Bruijn B">B De Bruijn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lease, M" uniqKey="Lease M">M Lease</name>
</author>
<author>
<name sortKey="Charniak, E" uniqKey="Charniak E">E Charniak</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pyysalo, S" uniqKey="Pyysalo S">S Pyysalo</name>
</author>
<author>
<name sortKey="Salakoski, T" uniqKey="Salakoski T">T Salakoski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rimell, L" uniqKey="Rimell L">L Rimell</name>
</author>
<author>
<name sortKey="Clark, S" uniqKey="Clark S">S Clark</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koning, D" uniqKey="Koning D">D Koning</name>
</author>
<author>
<name sortKey="Sarkar, In" uniqKey="Sarkar I">IN Sarkar</name>
</author>
<author>
<name sortKey="Moritz, T" uniqKey="Moritz T">T Moritz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Akella, Lm" uniqKey="Akella L">LM Akella</name>
</author>
<author>
<name sortKey="Norton, Cn" uniqKey="Norton C">CN Norton</name>
</author>
<author>
<name sortKey="Miller, H" uniqKey="Miller H">H Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gerner, M" uniqKey="Gerner M">M Gerner</name>
</author>
<author>
<name sortKey="Nenadic, G" uniqKey="Nenadic G">G Nenadic</name>
</author>
<author>
<name sortKey="Bergman, Cm" uniqKey="Bergman C">CM Bergman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Naderi, N" uniqKey="Naderi N">N Naderi</name>
</author>
<author>
<name sortKey="Kappler, T" uniqKey="Kappler T">T Kappler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abascal, R" uniqKey="Abascal R">R Abascal</name>
</author>
<author>
<name sortKey="Sanchez, Ja" uniqKey="Sanchez J">JA Sánchez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krauthammer, M" uniqKey="Krauthammer M">M Krauthammer</name>
</author>
<author>
<name sortKey="Rzhetsky, A" uniqKey="Rzhetsky A">A Rzhetsky</name>
</author>
<author>
<name sortKey="Morozov, P" uniqKey="Morozov P">P Morozov</name>
</author>
<author>
<name sortKey="Friedman, C" uniqKey="Friedman C">C Friedman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lenzi, L" uniqKey="Lenzi L">L Lenzi</name>
</author>
<author>
<name sortKey="Frabetti, F" uniqKey="Frabetti F">F Frabetti</name>
</author>
<author>
<name sortKey="Facchin, F" uniqKey="Facchin F">F Facchin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nasr, A" uniqKey="Nasr A">A Nasr</name>
</author>
<author>
<name sortKey="Rambow, O" uniqKey="Rambow O">O Rambow</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leaman, R" uniqKey="Leaman R">R Leaman</name>
</author>
<author>
<name sortKey="Gonzalez, G" uniqKey="Gonzalez G">G Gonzalez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schroder, M" uniqKey="Schroder M">M Schröder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Witten, Ih" uniqKey="Witten I">IH Witten</name>
</author>
<author>
<name sortKey="Frank, E" uniqKey="Frank E">E Frank</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blaschke, C" uniqKey="Blaschke C">C Blaschke</name>
</author>
<author>
<name sortKey="Hirschman, L" uniqKey="Hirschman L">L Hirschman</name>
</author>
<author>
<name sortKey="Valencia, A" uniqKey="Valencia A">A Valencia</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jimeno Yepes, A" uniqKey="Jimeno Yepes A">A Jimeno-Yepes</name>
</author>
<author>
<name sortKey="Aronson, Ar" uniqKey="Aronson A">AR Aronson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freeland, C" uniqKey="Freeland C">C Freeland</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kornai, A" uniqKey="Kornai A">A Kornai</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kornai, A" uniqKey="Kornai A">A Kornai</name>
</author>
<author>
<name sortKey="Mohiuddin, K" uniqKey="Mohiuddin K">K Mohiuddin</name>
</author>
<author>
<name sortKey="Connell, Sd" uniqKey="Connell S">SD Connell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freeland, C" uniqKey="Freeland C">C Freeland</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Willis, A" uniqKey="Willis A">A Willis</name>
</author>
<author>
<name sortKey="King, D" uniqKey="King D">D King</name>
</author>
<author>
<name sortKey="Morse, D" uniqKey="Morse D">D Morse</name>
</author>
<author>
<name sortKey="Dil, A" uniqKey="Dil A">A Dil</name>
</author>
<author>
<name sortKey="Lyal, C" uniqKey="Lyal C">C Lyal</name>
</author>
<author>
<name sortKey="Roberts, D" uniqKey="Roberts D">D Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bapst, F" uniqKey="Bapst F">F Bapst</name>
</author>
<author>
<name sortKey="Ingold, R" uniqKey="Ingold R">R Ingold</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weitzman, Al" uniqKey="Weitzman A">AL Weitzman</name>
</author>
<author>
<name sortKey="Lyal, Chc" uniqKey="Lyal C">CHC Lyal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rees, T" uniqKey="Rees T">T Rees</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sautter, G" uniqKey="Sautter G">G Sautter</name>
</author>
<author>
<name sortKey="Bohm, K" uniqKey="Bohm K">K Böhm</name>
</author>
<author>
<name sortKey="Agosti, D" uniqKey="Agosti D">D Agosti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Settles, B" uniqKey="Settles B">B Settles</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pavlopoulos, Ga" uniqKey="Pavlopoulos G">GA Pavlopoulos</name>
</author>
<author>
<name sortKey="Pafilis, E" uniqKey="Pafilis E">E Pafilis</name>
</author>
<author>
<name sortKey="Kuhn, M" uniqKey="Kuhn M">M Kuhn</name>
</author>
<author>
<name sortKey="Hooper, Sd" uniqKey="Hooper S">SD Hooper</name>
</author>
<author>
<name sortKey="Schneider, R" uniqKey="Schneider R">R Schneider</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pafilis, E" uniqKey="Pafilis E">E Pafilis</name>
</author>
<author>
<name sortKey="O Onoghue, Si" uniqKey="O Onoghue S">SI O’Donoghue</name>
</author>
<author>
<name sortKey="Jensen, Lj" uniqKey="Jensen L">LJ Jensen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuhn, M" uniqKey="Kuhn M">M Kuhn</name>
</author>
<author>
<name sortKey="Von Mering, C" uniqKey="Von Mering C">C von Mering</name>
</author>
<author>
<name sortKey="Campillos, M" uniqKey="Campillos M">M Campillos</name>
</author>
<author>
<name sortKey="Jensen, Lj" uniqKey="Jensen L">LJ Jensen</name>
</author>
<author>
<name sortKey="Bork, P" uniqKey="Bork P">P Bork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Balhoff, Jp" uniqKey="Balhoff J">JP Balhoff</name>
</author>
<author>
<name sortKey="Dahdul, Wm" uniqKey="Dahdul W">WM Dahdul</name>
</author>
<author>
<name sortKey="Kothari, Cr" uniqKey="Kothari C">CR Kothari</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dahdul, Wm" uniqKey="Dahdul W">WM Dahdul</name>
</author>
<author>
<name sortKey="Balhoff, Jp" uniqKey="Balhoff J">JP Balhoff</name>
</author>
<author>
<name sortKey="Engeman, J" uniqKey="Engeman J">J Engeman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sautter, G" uniqKey="Sautter G">G Sautter</name>
</author>
<author>
<name sortKey="Bohm, K" uniqKey="Bohm K">K Bohm</name>
</author>
<author>
<name sortKey="Agosti, D" uniqKey="Agosti D">D Agosti</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leary, Pr" uniqKey="Leary P">PR Leary</name>
</author>
<author>
<name sortKey="Remsen, Dp" uniqKey="Remsen D">DP Remsen</name>
</author>
<author>
<name sortKey="Norton, Cn" uniqKey="Norton C">CN Norton</name>
</author>
<author>
<name sortKey="Patterson, Dj" uniqKey="Patterson D">DJ Patterson</name>
</author>
<author>
<name sortKey="Sarkar, In" uniqKey="Sarkar I">IN Sarkar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Okazaki, N" uniqKey="Okazaki N">N Okazaki</name>
</author>
<author>
<name sortKey="Ananiadou, S" uniqKey="Ananiadou S">S Ananiadou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bontcheva, K" uniqKey="Bontcheva K">K Bontcheva</name>
</author>
<author>
<name sortKey="Tablan, V" uniqKey="Tablan V">V Tablan</name>
</author>
<author>
<name sortKey="Maynard, D" uniqKey="Maynard D">D Maynard</name>
</author>
<author>
<name sortKey="Cunningham, H" uniqKey="Cunningham H">H Cunningham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cunningham, H" uniqKey="Cunningham H">H Cunningham</name>
</author>
<author>
<name sortKey="Maynard, D" uniqKey="Maynard D">D Maynard</name>
</author>
<author>
<name sortKey="Bontcheva, K" uniqKey="Bontcheva K">K Bontcheva</name>
</author>
<author>
<name sortKey="Tablan, V" uniqKey="Tablan V">V Tablan</name>
</author>
<author>
<name sortKey="Ursu, C" uniqKey="Ursu C">C Ursu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fitzpatrick, E" uniqKey="Fitzpatrick E">E Fitzpatrick</name>
</author>
<author>
<name sortKey="Bachenko, J" uniqKey="Bachenko J">J Bachenko</name>
</author>
<author>
<name sortKey="Hindle, D" uniqKey="Hindle D">D Hindle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, M" uniqKey="Wood M">M Wood</name>
</author>
<author>
<name sortKey="Lydon, S" uniqKey="Lydon S">S Lydon</name>
</author>
<author>
<name sortKey="Tablan, V" uniqKey="Tablan V">V Tablan</name>
</author>
<author>
<name sortKey="Maynard, D" uniqKey="Maynard D">D Maynard</name>
</author>
<author>
<name sortKey="Cunningham, H" uniqKey="Cunningham H">H Cunningham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, L" uniqKey="Chen L">L Chen</name>
</author>
<author>
<name sortKey="Liu, H" uniqKey="Liu H">H Liu</name>
</author>
<author>
<name sortKey="Friedman, C" uniqKey="Friedman C">C Friedman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, H" uniqKey="Yu H">H Yu</name>
</author>
<author>
<name sortKey="Kim, W" uniqKey="Kim W">W Kim</name>
</author>
<author>
<name sortKey="Hatzivassiloglou, V" uniqKey="Hatzivassiloglou V">V Hatzivassiloglou</name>
</author>
<author>
<name sortKey="Wilbur, Wj" uniqKey="Wilbur W">WJ Wilbur</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chang, Jt" uniqKey="Chang J">JT Chang</name>
</author>
<author>
<name sortKey="Schutze, H" uniqKey="Schutze H">H Schutze</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wren, Jd" uniqKey="Wren J">JD Wren</name>
</author>
<author>
<name sortKey="Garner, Hr" uniqKey="Garner H">HR Garner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lydon, S" uniqKey="Lydon S">S Lydon</name>
</author>
<author>
<name sortKey="Wood, M" uniqKey="Wood M">M Wood</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Taylor, A" uniqKey="Taylor A">A Taylor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Radford, Ae" uniqKey="Radford A">AE Radford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Diederich, J" uniqKey="Diederich J">J Diederich</name>
</author>
<author>
<name sortKey="Fortuner, R" uniqKey="Fortuner R">R Fortuner</name>
</author>
<author>
<name sortKey="Milton, J" uniqKey="Milton J">J Milton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, M" uniqKey="Wood M">M Wood</name>
</author>
<author>
<name sortKey="Lydon, S" uniqKey="Lydon S">S Lydon</name>
</author>
<author>
<name sortKey="Tablan, V" uniqKey="Tablan V">V Tablan</name>
</author>
<author>
<name sortKey="Maynard, D" uniqKey="Maynard D">D Maynard</name>
</author>
<author>
<name sortKey="Cunningham, H" uniqKey="Cunningham H">H Cunningham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
<author>
<name sortKey="Heidorn, Pb" uniqKey="Heidorn P">PB Heidorn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, Q" uniqKey="Wei Q">Q Wei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Soderland, S" uniqKey="Soderland S">S Soderland</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, H" uniqKey="Cui H">H Cui</name>
</author>
<author>
<name sortKey="Singaram, S" uniqKey="Singaram S">S Singaram</name>
</author>
<author>
<name sortKey="Janning, A" uniqKey="Janning A">A Janning</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mabee, Pm" uniqKey="Mabee P">PM Mabee</name>
</author>
<author>
<name sortKey="Ashburner, M" uniqKey="Ashburner M">M Ashburner</name>
</author>
<author>
<name sortKey="Cronk, Q" uniqKey="Cronk Q">Q Cronk</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="review-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Adv Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Adv Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">ABI</journal-id>
<journal-title-group>
<journal-title>Advances in Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1687-8027</issn>
<issn pub-type="epub">1687-8035</issn>
<publisher>
<publisher-name>Hindawi Publishing Corporation</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22685456</article-id>
<article-id pub-id-type="pmc">3364545</article-id>
<article-id pub-id-type="doi">10.1155/2012/391574</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Review Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Applications of Natural Language Processing in Biodiversity Science</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Thessen</surname>
<given-names>Anne E.</given-names>
</name>
<xref ref-type="aff" rid="I1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Cui</surname>
<given-names>Hong</given-names>
</name>
<xref ref-type="aff" rid="I2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Mozzherin</surname>
<given-names>Dmitry</given-names>
</name>
<xref ref-type="aff" rid="I1">
<sup>1</sup>
</xref>
</contrib>
</contrib-group>
<aff id="I1">
<sup>1</sup>
Center for Library and Informatics, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA</aff>
<aff id="I2">
<sup>2</sup>
School of Information Resources and Library Science, University of Arizona, Tucson, AZ 85719, USA</aff>
<author-notes>
<corresp id="cor1">*Anne E. Thessen:
<email>athessen@mbl.edu</email>
</corresp>
<fn fn-type="other">
<p>Academic Editor: Jörg Hakenberg</p>
</fn>
</author-notes>
<pub-date pub-type="ppub">
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>22</day>
<month>5</month>
<year>2012</year>
</pub-date>
<volume>2012</volume>
<elocation-id>391574</elocation-id>
<history>
<date date-type="received">
<day>4</day>
<month>11</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>2</month>
<year>2012</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2012 Anne E. Thessen et al.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="open-access">
<license-p>This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science. </p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>1. Introduction</title>
<p> Biologists are expected to answer large-scale questions that address processes occurring across broad spatial and temporal scales, such as the effects of climate change on species [
<xref ref-type="bibr" rid="B1">1</xref>
,
<xref ref-type="bibr" rid="B2">2</xref>
]. This motivates the development of a new type of data-driven discovery focusing on scientific insights and hypothesis generation through the novel management and analysis of preexisting data [
<xref ref-type="bibr" rid="B3">3</xref>
,
<xref ref-type="bibr" rid="B4">4</xref>
]. Data-driven discovery presumes that a large, virtual pool of data will emerge across a wide spectrum of the life sciences, matching that already in place for the molecular sciences. It is argued that the availability of such a pool will allow biodiversity science to join the other “Big” (i.e., data-centric) sciences such as astronomy and high-energy particle physics [
<xref ref-type="bibr" rid="B5">5</xref>
]. Managing large amounts of heterogeneous data for this Big New Biology will require a cyberinfrastructure that organizes an open pool of biological data [
<xref ref-type="bibr" rid="B6">6</xref>
].</p>
<p>To assess the resources needed to establish the cyberinfrastructure for biology, it is necessary to understand the nature of biological data [
<xref ref-type="bibr" rid="B4">4</xref>
]. To become a part of the cyberinfrastructure, data must be ready to enter a digital data pool. This means data must be digital, normalized, and standardized [
<xref ref-type="bibr" rid="B4">4</xref>
]. Biological data sets are heterogeneous in format, size, degree of digitization, and openness [
<xref ref-type="bibr" rid="B4">4</xref>
,
<xref ref-type="bibr" rid="B7">7</xref>
,
<xref ref-type="bibr" rid="B8">8</xref>
]. The distribution of data packages in biology can be represented as a hollow curve [
<xref ref-type="bibr" rid="B7">7</xref>
] (
<xref ref-type="fig" rid="fig1">Figure 1</xref>
). To the left of the curve are the few providers producing large amounts of data, often derived from instruments and born digital such as in molecular biology. To the right of the curve are the many providers producing small amounts of data. It is estimated that 80% of scientific output comes from these small providers [
<xref ref-type="bibr" rid="B7">7</xref>
]. Generally called “small science,” these data are rarely preserved [
<xref ref-type="bibr" rid="B9">9</xref>
,
<xref ref-type="bibr" rid="B10">10</xref>
]. Scientific publication, a narrative explanation derived from primary data, is often the only lasting record of this work.</p>
<p>The complete body of research literature is a major container for much of our knowledge about the natural world and represents centuries of investment. The value of this information is high as it reflects observations that are difficult to replace if they are replaceable at all [
<xref ref-type="bibr" rid="B7">7</xref>
]. Much of the information has high relevance today, such as records on the historical occurrence of species that will help us better understand shifting abundances and distributions. Similarly, taxonomy, with its need to respect all nomenclatural acts back to the 1750s, needs to have access to information contained exclusively within this body of literature. Unfortunately, this knowledge has been presented in the narrative prose such that careful reading and annotation are required to make use of any information [
<xref ref-type="bibr" rid="B11">11</xref>
] and only a subset has been migrated into digital form.</p>
<p>The number of pages of the historical biodiversity literature is estimated to be approximately hundreds of millions [
<xref ref-type="bibr" rid="B12">12</xref>
]. Currently, over 33 million pages of legacy biology text are scanned and made available online through the Biodiversity Heritage Library (
<ext-link ext-link-type="uri" xlink:href="http://www.biodiversitylibrary.org/">http://www.biodiversitylibrary.org/</ext-link>
) and thousands of new digital pages are published every month in open-access biology journals (estimated based on 216 journals publishing approx 10 articles per month of less than 10 pages;
<ext-link ext-link-type="uri" xlink:href="http://www.doaj.org/doaj?cpid=67&func=subject">http://www.doaj.org/doaj?cpid=67&func=subject</ext-link>
). Additional biologically focused digital literature repositories can be found here (
<ext-link ext-link-type="uri" xlink:href="http://www.library.illinois.edu/nhx/resources/digitalresourcecatalogs.html">http://www.library.illinois.edu/nhx/resources/digitalresourcecatalogs.html</ext-link>
).</p>
<p>The information is in human-readable form but is too much for a person to transform into a digital data pool. Machines can better handle the volume, but cannot determine which elements of the text have value. In order to mobilize the valuable content in the literature, we need innovative algorithms to translate the entirety of the biological literature into a machine-readable form, extract the information with value, and feed it in a standards-compliant form into an open data pool. This paper discusses the application of natural language processing algorithms to biodiversity science to enable data-driven discovery.</p>
</sec>
<sec id="sec2">
<title>2. Overview</title>
<sec sec-type="subsection" id="sec2.1">
<title>2.1. Information Extraction</title>
<p>Research addressing the transformation of natural language text into a digital data pool is generally labeled as “information extraction” (IE). An IE task typically involves a corpus of source text documents to be acted upon by the IE algorithm and an extraction template that describes what will be extracted. For a plant character IE task, (e.g., [
<xref ref-type="bibr" rid="B13">13</xref>
]), a template may consist of taxon name, leaf shape, leaf size, leaf arrangement, and so forth (
<xref ref-type="table" rid="tab1">Table 1</xref>
). The characteristics of the source documents and the complexity of the template determine the difficulty level of an IE task. More complex IE tasks are often broken down to a series (stacks) of sub tasks, with a later subtask often relying on the success of an earlier one.
<xref ref-type="table" rid="tab2"> Table 2</xref>
illustrates typical subtasks involved in an IE task. Note, not all IE tasks involve all of these subtasks. Examples of information extraction tools for biology (not including biodiversity science) can be found in
<xref ref-type="table" rid="tab3">Table 3</xref>
.</p>
<p>The IE field has made rapid progress since the 1980s with the Message Understanding Conferences (MUCs) and has become very active since the 1990s due largely to the development of the World Wide Web. This has made available huge amounts of textual documents and human-prepared datasets (e.g., categorized web pages, databases) in an electronic format. Both can readily be used to evaluate the performance of an IE system. The massive production of digital information demands more efficient, computer-aided approaches to process, organize, and access the information. The urgent need to extract interesting information from large amounts of text to support knowledge discovery was recognized as an application for IE tools (e.g., identifying possible terrorists or terrorism attacks by extracting information from a large amount of email messages). For this reason, IE and other related research have acquired another, more general label “text data mining” (or simply “text mining”).</p>
<p>Information extraction algorithms are regularly evaluated based on three metrics: recall, precision, and the F score. Consider an algorithm trained to extract names of species from documents being run against a document containing the words: cat, dog, chicken, horse, goat, and cow. The recall would be the ratio of the number of “species words” extracted to the number in the document (6). So, an algorithm that only recognized cat and dog would have low recall (33%). Precision is the percentage of what the algorithm extracts that is correct. Since both cat and dog are species words, the precision of our algorithm would be 100% despite having a low recall. If the algorithm extracted all of the species words from the document, it would have both high precision and recall, but if it also extracts other words that are not species, then it would have low precision and high recall. The
<italic>F</italic>
score is an overall metric calculated from precision and recall when precision and recall are considered equally important:</p>
<p>
<disp-formula id="eq1">
<label>(1)</label>
<mml:math id="M1">
<mml:mi>F</mml:mi>
<mml:mi>  </mml:mi>
<mml:mtext>score</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mtext>precision</mml:mtext>
<mml:mi>  </mml:mi>
<mml:mi></mml:mi>
<mml:mi>  </mml:mi>
<mml:mtext>recall</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mtext>precision</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>recall</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>Before we review current IE systems for biodiversity science, we will first present a reference system architecture for a model IE system that covers the entire process of an IE application (
<xref ref-type="fig" rid="fig2">Figure 2</xref>
). In reviewing variant systems, we will refer to this reference architecture.</p>
<p>The blue-shaded areas in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
illustrate an IE system. The inputs to the IE system include source documents in a digital format (element number 1 in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
), an IE template which describes the IE task (2) and knowledge entities to perform the task (3). If documents are not in a digital format, OCR technologies can be used to make the transition (4; see below section on digitization), but then it is necessary to correct OCR errors before use (5). In this model system, we use “IE template” to refer not only to those that are well defined such as the leaf character template example in
<xref ref-type="table" rid="tab1">Table 1</xref>
, but also those more loosely defined. For example, we also consider lists of names and characters to be IE templates so the reference system can cover Named Entity Recognition systems (see below for examples) and character annotation systems (see below for examples). Knowledge entities include, for example, dictionaries, glossaries, gazetteers, or ontologies (3). The output of an IE system is often data in a structured format, illustrated as a database in the diagram (6). Ideally the structured format conforms to one of many data standards (7), which can range from relational database schemas to RDF. The arrow from Knowledge Entities to Extracted Data illustrates that, in some cases, the extracted data can be better interpreted with the support of knowledge entities (like annotation projects such as phenoscape,
<ext-link ext-link-type="uri" xlink:href="http://phenoscape.org/wiki/Main_Page">http://phenoscape.org/wiki/Main_Page</ext-link>
). The arrow from Data Standards to Extracted Data suggests the same.</p>
<p>NLP techniques are often used in combination with extraction methods (including hand-crafted rules and/or machine learning methods). Often the input documents contain text that are not relevant to an IE task [
<xref ref-type="bibr" rid="B14">14</xref>
]. In these cases, the blocks of text that contain extraction targets need to be identified and extracted first to avoid the waste of computational resources (8). An IE method is often used for this purpose (9). From the selected text, a series of tasks may be performed to extract target information (10) and produce final output (6; see also IE subtasks in
<xref ref-type="table" rid="tab2">Table 2</xref>
). This is often accomplished by first applying NLP techniques (11) and then using one or a combination of extraction methods (9). The arrow from extraction methods to NLP tools in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
indicates that machine learning and hand-crafted rules can be used to adapt/improve NLP tools for an IE task by, for example, extracting domain terms to extend the lexicon (12) used by a syntactic parser or even create a special purpose parser [
<xref ref-type="bibr" rid="B15">15</xref>
]. One important element that is not included in the model (
<xref ref-type="fig" rid="fig2">Figure 2</xref>
) is the human curation component. This is important for expert confirmation that extraction results are correct.</p>
</sec>
<sec sec-type="subsection" id="sec2.2">
<title>2.2. Natural Language Processing</title>
<p>IE is an area of application of natural language processing (NLP). NLP enables a computer to read (and possibly “understand”) information from natural language texts such as publications. NLP consists of a stack of techniques of increasing sophistication to progressively interpret language, starting with words, progressing to sentence structure (syntax or syntactic parsing), and ending at sentence meaning (semantics or semantic parsing) and meaning within sequences of sentences (discourse analysis). Typically an NLP technique higher in the stack (discourse analysis) utilizes the techniques below it (syntactic parsing). A variety of NLP techniques have been used in IE applications, but most only progress to syntactic parsing (some special IE applications specifically mentioned in this paper may not use any of the techniques). More sophisticated techniques higher in the stack (semantic parsing and discourse analysis) are rarely used in IE applications because they are highly specialized that is, cannot be reliably applied in general applications and are more computationally expensive.</p>
<p>Syntactic parsing can be shallow or deep. Shallow syntactic parsing (also called “chunking”) typically identifies noun, verb, preposition phrases, and so forth in a sentence (
<xref ref-type="fig" rid="fig3">Figure 3</xref>
), while deep syntactic parsing produces full parse trees, in which the syntactic function (e.g., Part of Speech, or POS) of each word or phrase is tagged with a short label (
<xref ref-type="fig" rid="fig4">Figure 4</xref>
). The most commonly used set of POS tags used is the Penn Treebank Tag Set (
<ext-link ext-link-type="uri" xlink:href="http://bulba.sdsu.edu/jeanette/thesis/PennTags.html">http://bulba.sdsu.edu/jeanette/thesis/PennTags.html</ext-link>
), which has labels for different parts of speech such as adjective phrases (ADJP), plural nouns (NNP), and so forth. Not all shallow parsers identify the same set of phrases. GENIA Tagger, for example, identifies adjective phrases (ADJP), adverb phrases (ADVP), conjunctive phrases (CONJP), interjections (INTJ), list markers (LST), noun phrases (NP), prepositional phrases (PP), participles (PRT), subordinate clauses (SBAR), and verb phrases (VP). Some shallow parsing tools are the Illinois Shallow Parser (
<ext-link ext-link-type="uri" xlink:href="http://cogcomp.cs.illinois.edu/page/software_view/13">http://cogcomp.cs.illinois.edu/page/software_view/13</ext-link>
) the Apache OpenNLP (
<ext-link ext-link-type="uri" xlink:href="http://incubator.apache.org/opennlp/index.html">http://incubator.apache.org/opennlp/index.html</ext-link>
), and GENIA Tagger (
<ext-link ext-link-type="uri" xlink:href="http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/">http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/</ext-link>
). Deep parsing tools include Stanford Parser (
<ext-link ext-link-type="uri" xlink:href="http://nlp.stanford.edu/software/lex-parser.shtml">http://nlp.stanford.edu/software/lex-parser.shtml</ext-link>
), Link Parser (
<ext-link ext-link-type="uri" xlink:href="http://www.link.cs.cmu.edu/link/">http://www.link.cs.cmu.edu/link/</ext-link>
) and Enju Parser (
<ext-link ext-link-type="uri" xlink:href="http://www-tsujii.is.s.u-tokyo.ac.jp/enju/">http://www-tsujii.is.s.u-tokyo.ac.jp/enju/</ext-link>
). A majority of IE applications use the shallow parsing technique, but the use of deep parsing techniques is on the rise in biology applications. This is driven in part because shallow parsing is not adequate to extract information from biology text [
<xref ref-type="bibr" rid="B16">27</xref>
<xref ref-type="bibr" rid="B18">29</xref>
].</p>
<p>Several NLP approaches are available for IE applications in biology that go beyond shallow parsing and are not mutually exclusive.</p>
<list list-type="order">
<list-item>
<p>
<italic>Pattern matching</italic>
approaches exploit basic patterns in text to extract information. An example pattern is “enzyme activates protein” or X activates Y. The computer would look for the specific text pattern and assume that all X are enzymes or all Y are proteins. Dictionary-based IE is a variant of pattern matching that focuses on finding words in text that are contained in a dictionary previously given to the computer. For example, the computer might be given a list of enzyme names (such as the UM-BBD list of enzyme names,
<ext-link ext-link-type="uri" xlink:href="http://umbbd.msi.umn.edu/servlets/pageservlet?ptype=allenzymes">http://umbbd.msi.umn.edu/servlets/pageservlet?ptype=allenzymes</ext-link>
; X in previous example). Once the enzyme name is located, the computer can infer the pattern that it “activates Y.” Another variant of pattern matching is the preposition-based parsing which focuses on finding prepositions like “by” and “of” and filling a basic template with information surrounding that preposition. An example of this would be “Y is activated by X.” Pattern matching suffers from the difficulty in accounting for the wide array of linguistic patterns used in text (X activates Y, Y is activated by X, Y was activated by X, X activated Y, Y is activated via X, X, which activates Y, etc.). Many of these systems extract phrases or sentences instead of structured facts, which limits their usefulness for further informatics. An example system that uses pattern matching is given in Krauthammer et al. [
<xref ref-type="bibr" rid="B19">37</xref>
].</p>
</list-item>
<list-item>
<p>
<italic>Full parsing</italic>
approaches expand on shallow parsing to include an analysis of sentence structure (i.e., syntax, see
<xref ref-type="fig" rid="fig3">Figure 3</xref>
). The biggest challenge with this approach is the special language of biology-specific texts. Most existing full-parsing systems are designed to handle general language texts, like news articles. The approach is also limited by grammar mistakes in the literature, which are often due to nonnative English speakers. Full parsing often runs into ambiguity due to the many ways a sentence (even moderately complex) can be interpreted by a machine. Sentence fragments, such as titles or captions, can also cause problems. UniGene Tabulator is an example of a full parser for biology [
<xref ref-type="bibr" rid="B20">38</xref>
].</p>
</list-item>
<list-item>
<p>
<italic>Probability-based</italic>
approaches offer a solution to the linguistic variability that confounds full parsing. These approaches use weighted grammar rules to decrease sensitivity to variation. The weights are assigned through processing of a large body of manually annotated text. Probabilistic grammars are used to estimate the probability that a particular parse tree will be correct or the probability that a sentence or sentence fragment has been recognized correctly. Results can be ranked according to the probabilities. Nasr and Rambow give an example of a probability-based parser [
<xref ref-type="bibr" rid="B21">39</xref>
].</p>
</list-item>
<list-item>
<p>
<italic>Mixed syntactic-semantic</italic>
approaches take advantage of syntactic and semantic knowledge together. This essentially combines part-of-speech taggers with named-entity recognition, such as in the BANNER system [
<xref ref-type="bibr" rid="B22">40</xref>
]. This removes reliance on lexicons and templates. This approach will be discussed further below.</p>
</list-item>
<list-item>
<p>
<italic>Sub language</italic>
-
<italic>driven</italic>
approaches use the specialized language of a specific community. A specialized sub language typically has a set of constraints that determine vocabulary, composition, and syntax that can be translated into a set of rules for an algorithm. Algorithms for use in processing biology text must cope with specialized language and the telegraphic sentence structure found in many taxonomic works. Being unaware of a sub language will often lead to incorrect assumptions about use of the language. Metexa is an example of a tool that uses a specialized sub language in the radiology domain [
<xref ref-type="bibr" rid="B23">41</xref>
].</p>
</list-item>
</list>
<p>NLP techniques are often used as a (standard) initial text processing procedure in an IE application. Once a computer has an understanding of the syntactic and/or semantic meaning of the text, other methods, such as manually derived rules or machine learning based methods, are then often used for further information extraction.</p>
</sec>
<sec sec-type="subsection" id="sec2.3">
<title>2.3. Machine Learning</title>
<p>Machine learning has been used in IE applications since 1990s. It is a process by which a machine (i.e., computer algorithm) improves its performance automatically with experience [
<xref ref-type="bibr" rid="B24">42</xref>
]. Creating extraction rules automatically by machine learning are favored over creating them manually because the hand-crafted rules take longer to create and this time accumulates for each new document collection [
<xref ref-type="bibr" rid="B25">43</xref>
]. As a generic method, machine-learning applications may be found in all aspects of an IE system, ranging from learning lexicons for a syntactic parser, classifying and relating potential extraction targets, to fitting extracted entities into an extraction template.</p>
<p>Learning can take various forms including rule sets, decision trees, clustering algorithms, linear models, Bayesian networks, artificial neural networks, and genetic algorithms (which are capable of mimicking chromosome mutations). Some machine-learning algorithms (e.g., most classification algorithms such as decision trees, naïve Baysian, Support Vector Machines) rely on substantial “training” before they can perform a task independently. These algorithms fall in the category of “supervised machine learning.” Some other algorithms (e.g., most clustering algorithms) require little or no training at all, so they belong to the “unsupervised machine learning” category. Due to the considerable cost associated with preparing training examples, one research theme in machine learning is to investigate innovative ways to reduce the amount of training examples required by supervised learning algorithms to achieve the desired level of performance. This gave rise to a third category of machine learning algorithms, “semisupervised.” Co-training is one of the learning approaches that falls into this category. Co-training refers to two algorithms that are applied to the same task, but learn about that task in two different ways. For example, an algorithm can learn about the contents of a web site by (1) reading the text of the web site or (2) reading the text of the links to the web site. The two bodies of text are different, but refer to the same thing (i.e., the web site). Two different algorithms can be used to learn about the web site, feed each other machine-made training examples (which reduces the requirements of human-made training examples), and often make each other better. However, co-training requires two independent views of the same learning task and two independent learners. Not all learning tasks fulfill these requirements. One line of research in NLP that uses co-training is word sense disambiguation [
<xref ref-type="bibr" rid="B26">44</xref>
]. We are not aware of the use of this learning approach in biodiversity information extraction. The best learning algorithm for a certain task is determined by the nature of the task and characteristics of source data/document collection, so it is not always possible to design an unsupervised or semisupervised algorithm for a learning task (i.e., an unsupervised algorithm to recognize human handwriting may not be possible).</p>
<p>The form of training examples required by a supervised algorithm is determined largely by the learning task and the algorithm used. For example, in Tang and Heidorn [
<xref ref-type="bibr" rid="B13">13</xref>
], the algorithm was to learn (automatically generate) rules to extract leaf properties from plant descriptions. A training example used in their research as well as the manually derived, correct extraction is in
<xref ref-type="fig" rid="figbox1">Box 1</xref>
(examples slightly modified for human readability).</p>
<p>By comparing original text (italics) and the text in bold, the algorithm can derive a set of candidate extraction rules based on context. The algorithm would also decide the order that the extraction rules may be applied according to the rules' reliability as measured with training examples. The more reliable rules would be utilized first. Two extraction rules generated by the Tang and Heidorn [
<xref ref-type="bibr" rid="B13">13</xref>
] algorithm are shown in
<xref ref-type="fig" rid="figbox2">Box 2</xref>
. Rule  1 extracts from the original text any leaf shape term (represented by ) following a term describing leaf blade (represented by ) and followed by a comma (,) as the leafShape (represented by the placeholder $1). Rule  2 extracts any expression consisting of a range and length unit (represented by ) that follows a comma (,) and is followed by another comma (,) and a leaf base term (represented by ) as the bladeDimension.</p>
<p>These rules can then be used to extract information from new sentences not included in the original training example.
<xref ref-type="fig" rid="figbox3">Box 3</xref>
shows how the rules match a new statement, and are applied to extract new leafShape and bladeDimension values.</p>
<p>This example illustrates a case where words are the basic unit of processing and the task is to classify words by using the context where they appear (
<italic>obovate</italic>
is identified as a leaf shape because it follows the phrase “leaf blade”).</p>
<p>In some applications, for example, named entity recognition (e.g., recognizing a word/phrase as a taxon name), an extraction target may appear in any context (e.g., a taxon name may be mentioned anywhere in a document). In these applications, the contextual information is less helpful in classifying a word/phrase than the letter combinations within the names. In NetiNeti, for example, a Naïve Baysian algorithm (a supervised learning algorithm based on Bayes conditional probability theorem) uses letter combinations to identify candidate taxon names [
<xref ref-type="bibr" rid="B27">32</xref>
]. When several training examples indicate names like
<italic>Turdus migratorius </italic>
are taxon names, NetiNeti may learn that a two-word phrase with the first letter capitalized and the last word ending with “
<italic>us</italic>
” (e.g.,
<italic> Felis catus</italic>
) is probably a taxon name, even though
<italic>Felis catus</italic>
has never appeared in training examples.</p>
<p>Supervised learning algorithms can be more difficult to use in biology largely because compiling large training datasets can be labor intensive, which decreases the adaptability and scalability of an algorithm to new document collections. Hundreds of controlled vocabularies exist for biological sciences, which can provide some training information to an algorithm but are often not comprehensive [
<xref ref-type="bibr" rid="B28">16</xref>
].</p>
<p>Unsupervised learning algorithms do not use training examples. These algorithms try to find hidden structure in unlabeled data using characteristics of the text itself. Well-known unsupervised learning algorithms include clustering algorithms, dimensionality reduction, and self-organization maps, to name a few. Cui et al. Boufford [
<xref ref-type="bibr" rid="B14">14</xref>
] designed an unsupervised algorithm to identify organ names and organ properties from morphological description sentences. The algorithm took advantage of a recurring pattern in which plural nouns that start a sentence are organs and a descriptive sentence starts with an organ name followed by a series of property descriptors. These characteristics of descriptive sentences allow an unsupervised algorithm to discover organ names and properties.</p>
<p>The procedure may be illustrated by using a set of five descriptive statements taken from Flora of North America (
<xref ref-type="fig" rid="figbox4">Box 4</xref>
).</p>
<p>Because
<italic>roots</italic>
is a plural noun and starts statement 1 (in addition, the words
<italic>rooting</italic>
or
<italic>rooted</italic>
are not seen in the entire document collection, so
<italic>roots</italic>
is unlikely a verb) the algorithm infers
<italic>roots</italic>
is an organ name. Then, what follows it (i.e.,
<italic>yellow</italic>
) must be a property. The algorithm remembers
<italic>yellow</italic>
is a property when it encounters statement 2 and it then infers that
<italic>petals</italic>
is an organ. Similarly, when it reads statement 3, because
<italic>petals</italic>
is an organ, the algorithm infers
<italic>absent</italic>
is a property, which enables the algorithm to further infer
<italic>subtending bracts</italic>
and
<italic>abaxial hastular</italic>
in statements 4 and 5 are organs. This example shows that by utilizing the description characteristics, the algorithm is able to learn that
<italic>roots</italic>
,
<italic>petals</italic>
,
<italic>subtending bracts</italic>
, and
<italic>abaxial hastular</italic>
are organ names and
<italic>yellow</italic>
and
<italic>absent</italic>
are properties, without using any training examples, dictionaries, or ontologies.</p>
<p>Because not all text possesses the characteristics required by the algorithm developed by Cui et al. [
<xref ref-type="bibr" rid="B14">14</xref>
], it cannot be directly applied to all taxon descriptions. However, because descriptions with those characteristics do exist in large numbers and because of the low overhead (in terms of preparing training examples) of the unsupervised learning algorithm, it is argued that unsupervised learning should be exploited when possible, such as when preparing text for a supervised learning task [
<xref ref-type="bibr" rid="B28">16</xref>
].</p>
</sec>
</sec>
<sec id="sec3">
<title>3. Review of Biodiversity Information Extraction Systems Annotation</title>
<p> Our review describes the features of each system existing at the time of this writing. Many of the systems are being constantly developed with new features and enhanced capabilities. We encourage the readers to keep track of the development of these systems.</p>
<p> Once text has been digitized, it can be annotated in preparation for an IE task or for use as training data for algorithms (
<xref ref-type="fig" rid="fig2">Figure 2</xref>
number 8). Both aims require different levels of annotation granularity, which can be accomplished manually or automatically using annotation software. A low level of granularity (coarse) is helpful for identifying blocks of text useful for IE. As mentioned before, not all text is useful for every IE task. In the practice of systematics, taxonomists need text containing nomenclatural acts which may be discovered and annotated automatically through terms such as “sp. nov.” and “nov. comb.” Annotation of these text blocks is helpful for algorithms designed to extract information about species. A finer granularity is needed for training data annotation. Words or phrases within a taxonomic work may be annotated as a name, description, location, and so forth. High granularity is more helpful for training a machine-learning algorithm but imposes a larger cost in time needed to do the manual annotation. There must be a balance between level of granularity and amount of manual investment which is determined by the specific goals at hand.</p>
<p>Manual annotation is very time consuming but can be assisted with annotation software. Several software packages aid with this.</p>
<p>
<statement id="head1">
<title>taXMLit</title>
<p>TaXMLit is an interface to allow annotation of taxonomic literature [
<xref ref-type="bibr" rid="B35">51</xref>
]. It was developed using botanical and zoological text, but also works well on paleontological text. This system is designed for annotation of text elements such as “description” and “locality.” This system requires a fairly large amount of human intervention and is not widely accepted.</p>
</statement>
</p>
<p>
<statement id="head2">
<title>GoldenGATE</title>
<p>GoldenGATE is an annotation tool for marking up taxonomic text in XML according to taxonX schema (
<ext-link ext-link-type="uri" xlink:href="http://plazi.org/files/GoldenGATE_V2_end_user_manual.pdf">http://plazi.org/files/GoldenGATE_V2_end_user_manual.pdf</ext-link>
, [
<xref ref-type="bibr" rid="B37">53</xref>
]. Most of the annotation is done semi-automatically, with users checking the correctness of the annotations in the GoldenGATE editor that facilitates manual XML mark up. There are several plugins available for GoldenGATE, including modules for annotation specific to ZooTaxa and new plugins can be relatively easily added. The system is implemented in JAVA. This system performs best with text marked up with basic html tags (such as paragraph and header) and high-quality OCR.</p>
</statement>
</p>
<p>
<statement id="head3">
<title>ABNER</title>
<p>ABNER is an entity recognition algorithm designed specifically for the biomedical literature [
<xref ref-type="bibr" rid="B38">54</xref>
]. It uses a conditional random fields (CRF) model. This is a type of Bayesian statistics, wherein the computer uses characteristics of the text to determine the probability that a given term should be annotated as a given class. In this case, the available classes are: protein, DNA, RNA, Cell line, and Cell type. A human uses a point-and-click interface to confirm the algorithm results and add the annotation.</p>
</statement>
</p>
<p>
<statement id="head4">
<title>OnTheFly</title>
<p>OnTheFly is a text annotator that automatically finds and labels names of proteins, genes, and other small molecules in Microsoft Office, pdf, and text documents [
<xref ref-type="bibr" rid="B39">55</xref>
]. A user submits a file through the interface and it converts the file of interest into html and sends it to the Reflect tool. This tool looks for names and synonyms of proteins and small molecules to annotate as such [
<xref ref-type="bibr" rid="B40">56</xref>
]. It uses a list of 5.8 million molecule names from 373 organisms and returns matching terms. Clicking on an annotated term returns a pop-up window with additional information. In addition, this tool can create a graphical representation of the relationships between these entities using the STITCH database [
<xref ref-type="bibr" rid="B41">57</xref>
].</p>
</statement>
</p>
<p>
<statement id="head5">
<title>Phenex</title>
<p>Phenex was designed for use in the phenotypic literature [
<xref ref-type="bibr" rid="B42">58</xref>
]. It is a user interface that aids in manual annotation of biological text using terms from existing ontologies. Phenex allows users to annotate free text or NEXUS files. A core function of this software is to allow users to construct EQ (Entity:Quality) statements representing phenotypes. An EQ statement consists of two parts, a character (entity) and state (quality). The character is described using a term from an anatomy ontology and the state of that character is described using a term from a quality ontology (see, e.g., [
<xref ref-type="bibr" rid="B43">59</xref>
]). An example would be supraorbital bone:sigmoid. The fact that sigmoid is a shape is inferred from the PATO ontology and thus does not have to be specifically mentioned in the EQ statement (within [
<xref ref-type="bibr" rid="B43">59</xref>
] see
<xref ref-type="fig" rid="fig1">Figure 1</xref>
). Users can load the ontology containing the terms they want to use for annotation into Phenex which has an auto-complete function to facilitate work. The Phenex GUI provides components for editing, searching, and graphical displays of terms. This software is open source, released under the MIT license (
<ext-link ext-link-type="uri" xlink:href="http://phenoscape.org/wiki/Phenex">http://phenoscape.org/wiki/Phenex</ext-link>
).</p>
</statement>
</p>
<sec sec-type="subsection" id="sec3.1">
<title>3.1. Digitization</title>
<p> The first step to making older biological literature machine readable is digitization (number 4 in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
). Book pages can be scanned as images of text and made into pdf files, but cannot be submitted to NLP processing in this form. To make the text accessible, it must be OCRed (Optical Character Recognition) to translate the image of text (such as  .pdf) into actual text (such as  .txt). The Biodiversity Heritage Library is in the process of digitizing 600,000 pages of legacy text a month, making them available as pdf image files and OCR text files [
<xref ref-type="bibr" rid="B29">45</xref>
]. Most modern publications are available as pdf and html files from the publisher (and thus do not need to be scanned or OCRed). Images of text can be run through software designed to OCR files on desktop computers or as a web service (i.e.,
<ext-link ext-link-type="uri" xlink:href="http://www.onlineocr.net/">http://www.onlineocr.net/</ext-link>
). OCR of handwriting is very different from that of text and can be quite difficult as there are as many handwriting styles as there are people. However, this type of OCR can be very important because significant portions of biodiversity data are only available as handwriting, such as museum specimen labels and laboratory notebooks. Algorithms do exist and are used for OCR of handwritten cities, states, and zip codes on envelopes and handwritten checks [
<xref ref-type="bibr" rid="B30">46</xref>
,
<xref ref-type="bibr" rid="B31">47</xref>
].</p>
<p> OCR is not a perfect technology. It is estimated that >35% of taxon names in BHL OCR files contain an error [
<xref ref-type="bibr" rid="B29">45</xref>
,
<xref ref-type="bibr" rid="B32">48</xref>
,
<xref ref-type="bibr" rid="B33">49</xref>
]. This is skewed, however, as older documents that use nonstandard fonts carry the majority of the errors [
<xref ref-type="bibr" rid="B33">49</xref>
]. Biodiversity literature can be especially difficult to OCR as they often have multiple languages on the same page (such as Latin descriptions), an expansive historical record going back to the 15th Century (print quality and consistency issues), and an irregular typeface or typesetting [
<xref ref-type="bibr" rid="B32">48</xref>
]. OCR is poor at distinguishing indentation patterns, bold, and italicized text, which can be important in biodiversity literature [
<xref ref-type="bibr" rid="B34">50</xref>
,
<xref ref-type="bibr" rid="B35">51</xref>
]. The current rate of digitization prohibits manual correction of these errors. Proposed solutions include components of crowd-sourcing manual corrections and machine-learning for automated corrections [
<xref ref-type="bibr" rid="B32">48</xref>
].</p>
<p>OCR errors may be overcome by using “fuzzy” matching algorithms that can recognize the correct term from the misspelled version. TAXAMATCH is a fuzzy matching algorithm for use in taxonomy [
<xref ref-type="bibr" rid="B36">52</xref>
]. The need for a “fuzzy matching” algorithm for detection of similar names is apparent for functions such as search, federation of content, and correction of misspellings or OCR errors. TAXAMATCH is a tool that uses phonetic- and nonphonetic-based near-match algorithms that calculate the distance of the given letter combination to a target name included in a reference database [
<xref ref-type="bibr" rid="B36">52</xref>
]. A letter combination with a close proximity to a target name is proposed as a fuzzy match. This system is being successfully used to increase hits in species databases [
<xref ref-type="bibr" rid="B36">52</xref>
] and is optimized for human typos rather than OCR errors. The php version of this code is available through Google code (
<ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/taxamatch-webservice/">http://code.google.com/p/taxamatch-webservice/</ext-link>
) and a Ruby version is available through git hub (
<ext-link ext-link-type="uri" xlink:href="https://github.com/GlobalNamesArchitecture/taxamatch_rb">https://github.com/GlobalNamesArchitecture/taxamatch_rb</ext-link>
).</p>
</sec>
<sec sec-type="subsection" id="sec3.2">
<title>3.2. Names Recognition and Discovery</title>
<p> A taxonomic name is connected to almost every piece of information about an organism, making names near universal metadata in biology (see Rod Page's iphylo blog entry
<ext-link ext-link-type="uri" xlink:href="http://iphylo.blogspot.com/2011/04/dark-taxa-genbank-in-post-taxonomic.html">http://iphylo.blogspot.com/2011/04/dark-taxa-genbank-in-post-taxonomic.html</ext-link>
for an exception). This can be exploited to find and manage nearly all biological data. No life-wide, comprehensive list of taxonomic names exists, but the Global Names Index (GNI) holds 20 million names and NameBank (
<ext-link ext-link-type="uri" xlink:href="http://www.ubio.org/index.php?pagename=namebank">http://www.ubio.org/index.php?pagename=namebank</ext-link>
) holds 10 million names. There are also exclusive lists of taxonomically creditable names such as the Catalogue of Life (CoLP) and the Interim Register of Marine and Non-marine Genera (IRMNG). These lists hold 1.3 million and 1.6 million names, respectively.</p>
<p>Taxonomic names discovery (or Named Entity Recognition in computer science parlance) can be achieved through several approaches.
<italic>Dictionary-based approaches </italic>
rely on an existing list of names. These systems try to find names on the list directly in the text. The major drawback of this approach in biology is that there is no comprehensive list of names and terms including all misspellings, variants, and abbreviations. Dictionary-based approaches can also miss synonyms and ambiguous names. Some algorithms have been developed to aid dictionary-based approaches with recognizing variants of names in the list (e.g., see algorithms described below).
<italic>Rule-based approaches</italic>
work by applying a fixed set of rules to a text. This approach is capable of dealing with variations in word order and sentence structure in addition to word morphology. The major drawback is that the rule sets are handmade (and, therefore, labor intensive) and are rarely applicable to multiple domains.
<italic>Machine</italic>
-
<italic>learning approaches</italic>
use rule sets generated by the machine using statistical procedures (such as Hidden Markov Models). In this approach, algorithms are trained on an annotated body of text in which names are tagged by hand. The algorithms can be applied to text in any discipline as long as appropriate training data are available. All of these approaches have strengths and weaknesses, so they are often combined in final products.</p>
<p>Several algorithms have been developed that are capable of identifying and discovering known and unknown (to the algorithm) taxon names in free text. These are discussed below and their performance metrics are given in
<xref ref-type="table" rid="tab4">Table 4</xref>
.</p>
<p>
<statement id="head6">
<title>TaxonGrab</title>
<p>TaxonGrab identifies names by using a combination of nomenclatural rules and a list (dictionary) of non-taxonomic English terms [
<xref ref-type="bibr" rid="B44">31</xref>
]. As most taxonomic names do not match words in common parlance, the dictionary can be used as a “black list” to exclude terms. This is not always the case because some Latin names match vernacular names, such as bison and
<italic>Bison bison</italic>
. The algorithm scans text for terms that are not found in the black list. It treats these as candidate names. These terms are then compared to the capitalization rules of Linnaean nomenclature. Algorithms of this type have low precision because misspelled, non-English words, medical, or legal terms would be flagged as a candidate name. However, these terms can be iteratively added to the black list, improving future precision. This method does have the advantage of not requiring a complete list of species names, but can only be used on English texts. Later, several additional rules were added to create a new product, FAT [
<xref ref-type="bibr" rid="B45">60</xref>
]. FAT employs “fuzzy” matching and structural rules sequentially so that each rule can use the results of the last. The TaxonGrab code is available at SourceForge, but the FAT code is not. FAT is a part of the plazi.org toolset for markup of taxonomic text.</p>
</statement>
</p>
<p>
<statement id="head7">
<title>TaxonFinder</title>
<p>TaxonFinder identifies scientific names in free text by comparing the name to several lists embedded into the source code ([
<xref ref-type="bibr" rid="B46">61</xref>
], Leary personal comments). These lists are derived from a manually curated version of NameBank (
<ext-link ext-link-type="uri" xlink:href="http://www.ubio.org/index.php?pagename=namebank">http://www.ubio.org/index.php?pagename=namebank</ext-link>
). A list of ambiguous names was compiled from words that are names, but are more often used in common parlance, like pluto or tumor. TaxonFinder breaks documents into words and compares them to the lists individually. When it encounters a capitalized word, it checks the “genus” and “above-genus” name lists. If the word is in the above-genus list, but not in the ambiguous name list, it is returned as a name. If it is in the genus list, the next word is checked to see if it is in lower case or all caps and to see if it is in the “species-or-below” name list. If it is, then the process is repeated with the next word until a complete polynomial is returned. If the next word is not in the list, then the previous name is returned as a genus. TaxonFinder is limited to dictionaries and thus will not find new names or misspellings but can discover new combinations of known names. This system can have both high precision and recall with a higher score in precision (more false negatives than false positives). A previous version of TaxonFinder, FindIT (
<ext-link ext-link-type="uri" xlink:href="http://www.ubio.org/tools/recognize.php">http://www.ubio.org/tools/recognize.php</ext-link>
), had the ability to identify authorship by recognizing the reference (usually a taxonomist's name), which TaxonFinder does not do (
<ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/taxon-name-processing/wiki/nameRecognition">http://code.google.com/p/taxon-name-processing/wiki/nameRecognition</ext-link>
). A new, Apache Lucene-based name indexer is now available from GBIF which is based on TaxonFinder (
<ext-link ext-link-type="uri" xlink:href="http://tools.gbif.org/namefinder/">http://tools.gbif.org/namefinder/</ext-link>
). The source code for TaxonFinder is available at Google code (
<ext-link ext-link-type="uri" xlink:href="http://code.google.com/p/taxon-finder/">http://code.google.com/p/taxon-finder/</ext-link>
).</p>
</statement>
</p>
<p>
<statement id="head8">
<title>NetiNeti</title>
<p>NetiNeti takes a more unsupervised approach to names extraction [
<xref ref-type="bibr" rid="B27">32</xref>
]. The system uses natural language processing techniques involving probabilistic classifiers (Naive Bayes classifier by default) to recognize scientific names in an arbitrary document. The classifier is trained to recognize characteristics of scientific names as well as the context. The algorithm uses “white list” and “black list” detection techniques in a secondary role. As a result, scientific names not mentioned in a white list or names with OCR errors or misspellings are found with great accuracy. Some of the limitations of NetiNeti include an inability to identify genus names less than four letters long, the assumption of one letter abbreviations of genera, and limitation of contextual information available to one word on either side of a candidate name. The code of this tool is written in Python and is going to be released under GPL2 license at
<ext-link ext-link-type="uri" xlink:href="https://github.com/mbl-cli/NetiNeti">https://github.com/mbl-cli/NetiNeti</ext-link>
.</p>
</statement>
</p>
<p>
<statement id="head9">
<title>Linnaeus</title>
<p>This is a list-based system designed specifically for identifying taxonomic names in biomedical literature and linking those names to database identifiers [
<xref ref-type="bibr" rid="B47">33</xref>
]. The system recognizes names contained in a white list (based on the NCBI classification and a custom set of synonyms) and resolves them to an unambiguous NCBI taxonomy identifier within the NCBI taxonomy database (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/books/NBK21100/">http://www.ncbi.nlm.nih.gov/books/NBK21100/</ext-link>
). In this way, multiple names for one species are normalized to a single identifier. This system is capable of recognizing and normalizing ambiguous mentions, such as abbreviations (
<italic>C. elegans</italic>
, which refers to 41 species) and acronyms (CMV, which refers to 2 species). Acronyms that are not listed within the NCBI classification are discovered using the Acromine service [
<xref ref-type="bibr" rid="B48">62</xref>
] and a novel acronym detector built into LINNAEUS that can detect acronym definitions within text (in the form of “species (acronym)”). Ambiguous mentions that are not resolvable are assigned a probability of how likely the mention refers to a species based on the relative frequency of nonambiguous mentions across all of MEDLINE. Applying a black list of species names that occur commonly in the English language when not referring to species (such as the common name spot) greatly reduces false positives. LINNAEUS can process files in XML and txt formats and give output in tab-separated files, XML, HTML and MySQL database tables. This code is available at SourceForge (
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/linnaeus/">http://sourceforge.net/projects/linnaeus/</ext-link>
).</p>
</statement>
</p>
<p>
<statement id="head10">
<title>OrganismTagger</title>
<p>This system uses the NCBI taxonomy database (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/books/NBK21100/">http://www.ncbi.nlm.nih.gov/books/NBK21100/</ext-link>
) to generate semantically enabled lists and ontology components for organism name extraction from free text [
<xref ref-type="bibr" rid="B49">34</xref>
]. These components are connected to form a work flow pipeline using GATE (the General Architecture for Text Engineering; [
<xref ref-type="bibr" rid="B50">63</xref>
,
<xref ref-type="bibr" rid="B51">64</xref>
]). These components are a combination of rule-based and machine-learning approaches to discover and extract names from text, including strain designations. To identify strains not in the NCBI taxonomy database, OrganismTagger uses a “strain classifier,” a machine-learning (SVM model) approach trained on manually annotated documents. After the strain classifier is applied, organism names are first detected, then normalized to a single canonical name and grounded to a specific NCBI database ID. The semantic nature of this tool allows it to output data in many different formats (XML, OWL, etc.). This code along with supporting materials is available under an open source license at
<ext-link ext-link-type="uri" xlink:href="http://www.semanticsoftware.info/organism-tagger">http://www.semanticsoftware.info/organism-tagger</ext-link>
.</p>
</statement>
</p>
</sec>
<sec sec-type="subsection" id="sec3.3">
<title>3.3. Morphological Character Extraction</title>
<p>Morphological characters of organisms are of interest to systematists, evolutionary biologists, ecologists, and the general public. The examples used in
<xref ref-type="fig" rid="fig4">Figure 4</xref>
are typical of morphological descriptions. The kinds of language used in biodiversity science has the following characteristics that make it difficult for general-purpose parsers to process [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B52">65</xref>
,
<xref ref-type="bibr" rid="B53">66</xref>
].</p>
<list list-type="order">
<list-item>
<p>Specialized Language. Most scientific terms are not in the lexicons of existing parsers. Even if they were, biological terms are more ambiguous than general English [
<xref ref-type="bibr" rid="B54">67</xref>
]. General English has 0.57% ambiguous terms while gene names have 14.2% ambiguity. Taxonomic homonyms are 15% at the genus level (
<ext-link ext-link-type="uri" xlink:href="http://www.obis.org.au/irmng/irmng_faq/">http://www.obis.org.au/irmng/irmng_faq/</ext-link>
). Life Science literature also relies heavily on abbreviations [
<xref ref-type="bibr" rid="B55">68</xref>
]. There were over 64,000 new abbreviations introduced in 2004 in the biomedical literature alone and an average of one new abbreviation every 5–10 abstracts [
<xref ref-type="bibr" rid="B56">69</xref>
]. Dictionaries, such as the Dictionary of Medical Acronyms and Abbreviations can help, but most dictionaries contain 4,000 to 32,000 terms, which is only a fraction of the estimated 800,000 believed to exist [
<xref ref-type="bibr" rid="B56">69</xref>
,
<xref ref-type="bibr" rid="B57">70</xref>
]. This means that dictionary-based approaches will not scale to work in biology.</p>
</list-item>
<list-item>
<p>Diversity. Descriptions are very diverse across taxon groups. Even in one group, for example, plants, variations are large. Lydon et al. [
<xref ref-type="bibr" rid="B58">71</xref>
] compared and contrasted the descriptions of five common species in six different English language Floras and found the same information in all sources only 9% of the time. They also noted differences in terminology usage across Floras.</p>
</list-item>
<list-item>
<p>Syntax differences. Many species descriptions are in telegraphic sublanguage (that lacks of verbs) but there are also many descriptions conforming to more standard English syntax. Parsers expecting standard English syntax often mistake other groups of words for verbs when parsing telegraphic sublanguage because they expect to see verbs in a sentence. There is not typically standardized syntax across different taxon groups or even within the same group.</p>
</list-item>
</list>
<p>Taylor [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B59">72</xref>
] manually constructed a grammar and a lexicon of 2000 characters, and character states (1500 from Radford [
<xref ref-type="bibr" rid="B60">73</xref>
] and 500 from descriptive text) to parse the Flora of New South Wales (4 volumes) and volume 19 of the Flora of Australia. The goal of parsing these Floras was to create sets of organism part, character, and character state from each description. These statements can be extracted from morphological taxon descriptions using the hand-crafted parser to get a machine-readable set of facts about organism characteristics. While the sublanguage nature of the plant descriptions used by Taylor [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B59">72</xref>
] made it easier to construct the grammar and lexicon manually, the author acknowledged the limited coverage they could be expected to achieve (60–80% recall was estimated based on manual examination of output). Algorithms for machine-aided expansion of the lexicon were suggested; however, at the time automated creation of rules was believed to be too difficult.</p>
<p>Since Taylor [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B59">72</xref>
], a variety of methods have been used to extract morphological traits from morphological descriptions. Their performance metrics are given in
<xref ref-type="table" rid="tab4">Table 4</xref>
.</p>
<p>
<statement id="head11">
<title>X-Tract</title>
<p>X-tract [
<xref ref-type="bibr" rid="B61">35</xref>
] was an interactive tool to extract morphological information from Flora of North America (FNA) descriptions available as a print and HTML version. X-tract used HTML tags embedded in the FNA pages to identify the morphological description sections. It used a glossary to classify each word in a description as structure (i.e., organs or part of organs) or character states. If a word was a character state, its corresponding characters were looked up in the glossary. Then, X-tract created a form to display the structures, substructures, characters, and character states extracted from a document for a user to review, modify, and save to a database. Evaluation of the extraction accuracy or the extent of user intervention was not provided.</p>
</statement>
</p>
<p>
<statement id="head12">
<title>Worldwide Botanical Knowledge Base</title>
<p>Jean-Marc Vanel initiated a project called Worldwide Botanical Knowledge Base, which also takes the approach of parsing plus glossary/lexicon. It marks up morphological descriptions at sentence level (e.g., leaf blade obovate is marked as “leaf blade”) without extracting detailed character information. It stores extracted information in XML files instead of a relational database as Taylor [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B59">72</xref>
] and Abascal and Sánchez [
<xref ref-type="bibr" rid="B61">35</xref>
]. The project aims to support queries on species descriptions in botanical databases. The database search seems to have stopped working (
<ext-link ext-link-type="uri" xlink:href="http://jmvanel.free.fr/protea.html">http://jmvanel.free.fr/protea.html</ext-link>
). The parser was reported to work on Flora of China and it can be downloaded from the website (
<ext-link ext-link-type="uri" xlink:href="http://wwbota.free.fr/">http://wwbota.free.fr/</ext-link>
). However, as of the time of this publication, the authors were unable to use the parser.</p>
</statement>
</p>
<p>
<statement id="head13">
<title>Terminator</title>
<p>Diederich, Fortuner and Milton [
<xref ref-type="bibr" rid="B62">74</xref>
] developed a system called Terminator, which used a hand-crafted plant glossary that amounts to an ontology including structure names, characters and character states to support character extraction. The extraction process was a combination of fuzzy keyword match and heuristic extraction rules. Because Terminator was an interactive system (i.e., a human operator selects correct extractions), the evaluation was done on 16 descriptions to report the time taken to process them. Extraction performance was evaluated only on 1 random sample: for non-numerical characters, 55% of the time a perfect structure/character/value combination was among the first 5 candidates suggested by the system.</p>
</statement>
</p>
<p>
<statement id="head14">
<title>MultiFlora</title>
<p>Similar to previous works, Wood, Lydon, and colleagues' MultiFlora project (
<ext-link ext-link-type="uri" xlink:href="http://intranet.cs.man.ac.uk/ai/public/MultiFlora/MF1.html">http://intranet.cs.man.ac.uk/ai/public/MultiFlora/MF1.html</ext-link>
) started with manual analysis of description documents. They created an ontology manually, which included classes of organs (i.e., petal) and features (i.e., yellow) linked by properties (i.e., hasColor). They also manually created a gazetteer, which included terms referring to the organs and features that served as a lookup list. The prototype MultiFlora system used a combination of keyword matching, internal and contextual pattern matching, and shallow parsing techniques provided by GATE to extract organ and feature information from a small collection of morphological descriptions (18 species descriptions, recall, and precision were in the range of mid 60% to mid 70%; [
<xref ref-type="bibr" rid="B53">66</xref>
,
<xref ref-type="bibr" rid="B63">75</xref>
]). While the work of Wood, Lydon, and colleagues shows that using descriptions from different sources can be used to improve recall, the authors acknowledged that organs not included in the manually-created gazetteer/ontology have to be marked as “unknown.” The extraction results were output in RDF triples and used to build a knowledgebase about plants, which is not related to Worldwide Botanical Knowledge Base reviewed earlier. RDF is a type of programming language that allows a user to make machine readable assertions in the form of an RDF triple. The EQ format mentioned earlier is a similar format used in biology. The advantage to using ontology-supported RDF/EQ is that multiple data providers can use the same ontological identifier for the same term. In this way, statements become machine-readable and can be linked regardless of the source. With ontological support, machine-based logic reasoning has become possible. An immediate application of this type of reasoning and a pool of RDF triples describing species morphology is a specimen identification key. RDF is supported by more recent biodiversity IE systems as an output format.</p>
</statement>
</p>
<p>
<statement id="head15">
<title>MARTT</title>
<p>MARTT [
<xref ref-type="bibr" rid="B64">76</xref>
] is an automated description markup system employing a supervised machine-learning algorithm. The system marks up a description sentence-by-sentence with tags that indicate the subject, for example, “stem” is tagged in the text statement “stem solitary.” MARTT along with a test collection is downloadable from
<ext-link ext-link-type="uri" xlink:href="http://sites.google.com/site/biosemanticsproject/project-progress-wiki">http://sites.google.com/site/biosemanticsproject/project-progress-wiki</ext-link>
. Wei [
<xref ref-type="bibr" rid="B65">77</xref>
] conducted an exploratory study of the application of information fusion techniques to taxonomic descriptions. It confirmed Wood et al. [
<xref ref-type="bibr" rid="B63">75</xref>
] finding that combining multiple descriptions of the same species from different sources and different taxonomic ranks can provide the researchers more complete information than any single description. Wei used MARTT [
<xref ref-type="bibr" rid="B64">76</xref>
] and a set of heuristic rules to extract character information from descriptions of taxa published in both FNA and Flora of China (FoC) and categorized the extracted information between the two sources as either identical, equivalent, subsumption, complementary, overlap, or conflict. Non-conflict information from both sources was then merged together. The evaluation was conducted involving 13 human curators verifying results generated from 153 leaf descriptions. The results show that the precisions for genus level fusion, species level fusion, FNA genus-species fusion, and FoC genus-species fusion were 77%, 63%, 66%, and 71%, respectively. The research also identified the key factors that contribute to the performance of the system: the quality of the dictionary (or the domain knowledge), the variance of the vocabulary, and the quality of prior IE steps.</p>
</statement>
</p>
<p>
<statement id="head16">
<title>WHISK</title>
<p>Tang and Heidorn [
<xref ref-type="bibr" rid="B13">13</xref>
] adapted WHISK [
<xref ref-type="bibr" rid="B66">78</xref>
] to extract morphological character and other information from several volumes of FNA to show that IE helps the information retrieval system SEARFA (e.g., retrieval of relevant documents). The “pattern matching” learning method used by WHISK is described in
<xref ref-type="sec" rid="sec2">Section 2</xref>
. The pattern matching algorithm was assisted by a knowledge base created by manually collecting structure and character terms from training examples. The IE system was evaluated on a relatively small subset of FNA documents and it was evaluated on different template slots (see
<xref ref-type="table" rid="tab1">Table 1</xref>
for examples of template slots) separately. Different numbers of training and/or test examples were used for different slots (training examples ranged from 7 to 206, test examples ranged from 6 to 192) and the performance scores were obtained from one run (as opposed to using the typical protocol for supervised learning algorithms). The system performed perfectly on nonmorphological character slots (Genus, Species, and Distribution). The recall on morphological character slots (Leaf shape, Leaf margin, Leaf apex, Leaf base, Leaf arrangement, Blade dimension, Leaf color, and Fruit/nut shape) ranged from 33.33% to 79.65%. The precision ranged from 75.52% to 100%. Investigation of human user performance on plant identification using internet-based information retrieval systems showed that even with imperfect extraction performance, users were able to make significantly more identifications using the information retrieval system supported by the extracted character information than using a keyword-based full-text search system.</p>
</statement>
</p>
<p>
<statement id="head17">
<title>CharaParser</title>
<p>All IE systems reviewed above relied on manually created vocabulary resources, whether they are called lexicons, gazetteers, or knowledge bases. Vocabularies are a fundamental resource on which more advanced syntactic and semantic analyses are built. While manually collecting terms for a proof-of-concept system is feasible, the manual approach cannot be scaled to the problem of extracting morphological traits of all taxa. Cui, Seldon & Boufford [
<xref ref-type="bibr" rid="B14">14</xref>
] proposed an unsupervised bootstrapping based algorithm (described in
<xref ref-type="sec" rid="sec2">Section 2</xref>
) that can extract 93% of anatomical terms and over 50% character terms from text descriptions without any training examples. This efficient tool may be used to build vocabulary resources that are required to use various IE systems on new document collections. </p>
<p>This unsupervised algorithm has been used in two IE systems [
<xref ref-type="bibr" rid="B68">36</xref>
,
<xref ref-type="bibr" rid="B67">79</xref>
]. One of the systems used intuitive heuristic rules to associate extracted character information with appropriate anatomical structures. The other system (called CharaParser) adapted a general-purpose syntactic parser (Stanford Parser) to guide the extraction. In addition to structures and character extraction, both systems extract constraints, modifiers, and relations among anatomical structures (e.g., head
<italic>subtended by</italic>
distal leaves; pappi
<italic>consist of</italic>
bristles) as stated in a description. Both systems were tested on two sets of descriptions from volume 19 of FNA and Part H of Treatise on Invertebrate Paleontology (TIP); each set consisted of over 400 descriptions. The heuristic rule-based system achieved precision/recall of 63%/60% on the FNA evaluation set and 52%/43% on the TIP evaluation set on character extraction. CharaParser performed significantly better and achieved precision/recall of 91%/90% on the FNA set and 80%/87% on the TIP set. Similar to Wood et al. [
<xref ref-type="bibr" rid="B53">66</xref>
], Cui and team found the information structure of morphological descriptions was too complicated to be represented in a typical IE template (such as
<xref ref-type="table" rid="tab1">Table 1</xref>
). Wood et al. [
<xref ref-type="bibr" rid="B53">66</xref>
] designed an ontology to hold the extracted information, while Cui and team used XML to store extracted information (
<xref ref-type="fig" rid="fig5">Figure 5</xref>
). CharaParser is expected to be released as an open-source software in Fall 2012. Interested readers may contact the team to obtain a trial version before its release.</p>
</statement>
</p>
</sec>
<sec sec-type="subsection" id="sec3.4">
<title>3.4. Integrated IE Systems</title>
<p> Tang and Heidorn [
<xref ref-type="bibr" rid="B13">13</xref>
] supervised learning IE system, MutiFlora, and the CharaParser system, all reviewed before, can be described using the reference model depicted in
<xref ref-type="fig" rid="fig2">Figure 2</xref>
. Here, we describe another system that integrates formal ontologies. This is the text mining system that is currently under development by the Phenoscape project (
<ext-link ext-link-type="uri" xlink:href="http://www.phenoscape.org/">http://www.phenoscape.org/</ext-link>
). The goal of Phenoscape is to turn text phenotype descriptions to EQ expressions [
<xref ref-type="bibr" rid="B69">80</xref>
] to support machine reasoning of scientific knowledge as a transforming way of conducting biological research. In this application, EQ expressions may be considered both the IE template and a data standard. The input to the Phenoscape text mining system is digital or OCRed phylogenetic publications. The character descriptions are targeted (1 character description = 1 character statement + multiple character state statements) and used to form the taxon-character matrix. The target sections are extracted by student assistants using Phenex and put into NeXML (
<ext-link ext-link-type="uri" xlink:href="http://www.nexml.org/">http://www.nexml.org/</ext-link>
) format. NeXML is an exchange standard for representing phyloinformatic data. It is inspired by the commonly used NEXUS format, but more robust and easier to process. There is one NeXML file for a source text. NeXML files are the input to CharaParser, which performs bootstrapping-based learning (i.e., unsupervised learning) and deep parsing to extract information and output candidate EQ expressions. CharaParser learns lexicons of anatomy terms and character terms from description collections. Learned terms are reviewed by biologist curators (many OCR errors are detected during this step). Terms that are not in existing anatomy ontologies are proposed to the ontologies for addition. The lexicons and ontologies are the knowledge entities that the text mining system iteratively uses and enhances. With new terms added to the ontologies, the system replaces the terms in candidate EQ statements with term IDs from the ontologies. For example, [E]tooth [Q]large is turned into [E]TAO: 0001625 [Q]PATO: 0001202. The candidate EQ expressions are reviewed and accepted by biologist curators using Phenex. Final EQ expressions are loaded into the Phenoscape Knowledge base at
<ext-link ext-link-type="uri" xlink:href="http://kb.phenoscape.org/">http://kb.phenoscape.org/</ext-link>
. This EQ populated knowledge base supports formal logical reasoning. At the time of writing, the developing work is ongoing to integrate CharaParser with Phenex to produce an integrated text-mining system for Phenoscape. It is important to notice that the applicability of Phenex and CharaParser is not taxon specific.</p>
</sec>
</sec>
<sec id="sec4">
<title>4. Conclusion</title>
<p> NLP approaches are capable of extracting large amounts of information from free text. However, biology text presents a unique challenge (compared to news articles) to machine-learning algorithms due to its ambiguity, diversity, and specialized language. Successful IE strategies in biodiversity science take advantage of the Linnaean binomial structure of names and the structured nature of taxon descriptions. Multiple tools currently exist for fuzzy matching of terms, automated annotation, named-entity recognition, and morphological character extraction that use a variety of approaches. None have yet been used on a large scale to extract information about all life, but several, such as CharaParser, show potential to be used in this way. Further improvement of biodiversity IE tools could be achieved through increased participation in the annual BioCreative competitions (
<ext-link ext-link-type="uri" xlink:href="http://www.biocreative.org/">http://www.biocreative.org/</ext-link>
) and assessing tool performance on publicly available document sets so that comparison between systems (and thus identification of methods that have real potential to address biodiversity IE problems) becomes easier.</p>
<p> A long-term vision for the purpose of making biodiversity data machine readable is the compilation of semantic species descriptions that can be linked into a semantic web for biology. An example of semantic species information can be found at TaxonConcept.org. This concept raises many questions concerning semantics which are outside the scope of this paper, such as what makes a “good” semantic description of a species. Many of these issues are technical and are being addressed within the computer science community. There are two data pathways that need to be developed to achieve the semantic web for biology. One is a path going forward, in which new data are made machine-readable from the beginning of a research project. The model of mobilizing data many years after collection with little to no data management planning during collection is not sustainable or desirable going into the future. Research is being applied to this area and publishers, such as Pensoft, are working to capture machine-readable data about species at the point of publication. The other is a path for mobilizing data that have already been collected. NLP holds much promise in helping with the second path.</p>
<p>Mobilizing the entirety of biodiversity knowledge collected over the past 250 years is an ambitious goal that requires meeting several challenges from both the taxonomic and technological fronts. Considering the constantly changing nature of biodiversity science and the constraints of NLP algorithms, best results may be achieved by drawing information from high quality modern reviews of taxonomic groups rather than repositories of original descriptions. However, such works can be rare or nonexistent for some taxa. Thus, issues such as proper aggregation of information extracted from multiple sources on a single subject (as mentioned above) still need to be addressed. In addition, demanding that a modern review be available somewhat defeats the purpose of applying NLP to biodiversity science. While using a modern review may be ideal when available, it should not be required for information extraction.</p>
<p> Biodiversity science, as a discipline, is being asked to address numerous challenges related to climate change, biodiversity loss, and invasive species. Solutions to these problems require discovery and aggregation of data from the entire pool of biological knowledge including what is contained exclusively in print holdings. Digitization and IE on this scale is unprecedented. Unsupervised algorithms hold the greatest promise for achieving the scalability required because they do not require manually generated training data. However, most successful IE algorithms use combinations of supervised and unsupervised strategies and multiple NLP approaches because not all problems can be solved with an unsupervised algorithm. If the challenge is not met, irreplaceable data from centuries of research funded by billions of dollars may be lost. The annotation and extraction algorithms mentioned in this manuscript are key steps toward liberating existing biological data and even serve as preliminary evidence that this goal can be achieved.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p> The authors would like to thank Dr. David J. Patterson, Dr. Holly Bowers, and Mr. Nathan Wilson for thoughtful comments on an early version of this manuscript and productive discussion. This work was funded in part by the MacArthur Foundation Grant to the Encyclopedia of Life, the National Science Foundation Data Net Program Grant no. 0830976, and the National Science Foundation Emerging Front Grant no. 0849982.</p>
</ack>
<ref-list>
<ref id="B1">
<label>1</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wuethrich</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>How climate change alters rhythms of the wild</article-title>
<source>
<italic>Science</italic>
</source>
<year>2000</year>
<volume>287</volume>
<issue>5454</issue>
<fpage>793</fpage>
<lpage>795</lpage>
<pub-id pub-id-type="pmid">10691549</pub-id>
</element-citation>
</ref>
<ref id="B2">
<label>2</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bradshaw</surname>
<given-names>WE</given-names>
</name>
<name>
<surname>Holzapfel</surname>
<given-names>CM</given-names>
</name>
</person-group>
<article-title>Genetic shift in photoperiodic response correlated with global warming</article-title>
<source>
<italic>Proceedings of the National Academy of Sciences of the United States of America</italic>
</source>
<year>2001</year>
<volume>98</volume>
<issue>25</issue>
<fpage>14509</fpage>
<lpage>14511</lpage>
<pub-id pub-id-type="pmid">11698659</pub-id>
</element-citation>
</ref>
<ref id="B3">
<label>3</label>
<element-citation publication-type="journal">
<collab>National Academy of Sciences </collab>
<article-title>New biology for the 21st Century</article-title>
<source>
<italic>Frontiers in Ecology and the Environment</italic>
</source>
<year>2009</year>
<volume>7</volume>
<issue>9, article 455</issue>
</element-citation>
</ref>
<ref id="B4">
<label>4</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thessen</surname>
<given-names>AE</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Data issues in life science</article-title>
<source>
<italic>ZooKeys</italic>
</source>
<year>2011</year>
<volume>150</volume>
<fpage>15</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="pmid">22207805</pub-id>
</element-citation>
</ref>
<ref id="B5">
<label>5</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Hey</surname>
<given-names>A</given-names>
</name>
</person-group>
<source>
<italic>The Fourth Paradigm: Data-Intensive Scientific Discovery</italic>
</source>
<year>2009</year>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://iw.fh-potsdam.de/fileadmin/FB5/Dokumente/forschung/tagungen/i-science/TonyHey_-__eScience_Potsdam__Mar2010____complete_.pdf">http://iw.fh-potsdam.de/fileadmin/FB5/Dokumente/forschung/tagungen/i-science/TonyHey_-__eScience_Potsdam__Mar2010____complete_.pdf</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B6">
<label>6</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stein</surname>
<given-names>LD</given-names>
</name>
</person-group>
<article-title>Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges</article-title>
<source>
<italic>Nature Reviews Genetics</italic>
</source>
<year>2008</year>
<volume>9</volume>
<fpage>678</fpage>
<lpage>688</lpage>
</element-citation>
</ref>
<ref id="B7">
<label>7</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heidorn</surname>
<given-names>PB</given-names>
</name>
</person-group>
<article-title>Shedding light on the dark data in the long tail of science</article-title>
<source>
<italic>Library Trends</italic>
</source>
<year>2008</year>
<volume>57</volume>
<issue>2</issue>
<fpage>280</fpage>
<lpage>299</lpage>
</element-citation>
</ref>
<ref id="B8">
<label>8</label>
<element-citation publication-type="journal">
<collab>Key Perspectives Ltd</collab>
<article-title>Data dimensions: disciplinary differences in research data sharing, reuse and long term viability</article-title>
<comment>Digital Curation Centre, 2010,
<ext-link ext-link-type="uri" xlink:href="http://scholar.google.com/scholar?hl=en&q=Data+Dimensions:+disciplinary+differences+in+research+data-sharing,+reuse+and+long+term+viability.++&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0">http://scholar.google.com/scholar?hl=en&q=Data+Dimensions:+disciplinary+differences+in+research+data-sharing,+reuse+and+long+term+viability.++&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B9">
<label>9</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vollmar</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Macklin</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Ford</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Natural history specimen digitization: challenges and concerns</article-title>
<source>
<italic>Biodiversity Informatics</italic>
</source>
<year>2010</year>
<volume>7</volume>
<issue>2</issue>
</element-citation>
</ref>
<ref id="B10">
<label>10</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schofield</surname>
<given-names>PN</given-names>
</name>
<name>
<surname>Eppig</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Huala</surname>
<given-names>E</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Sustaining the data and bioresource commons</article-title>
<source>
<italic>Research Funding</italic>
</source>
<year>2010</year>
<volume>330</volume>
<issue>6004</issue>
<fpage>592</fpage>
<lpage>593</lpage>
</element-citation>
</ref>
<ref id="B11">
<label>11</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Groth</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gibson</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Velterop</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Anatomy of a Nanopublication</article-title>
<source>
<italic>Information Services & Use</italic>
</source>
<year>2010</year>
<volume>30</volume>
<issue>1-2</issue>
<fpage>51</fpage>
<lpage>56</lpage>
</element-citation>
</ref>
<ref id="B12">
<label>12</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kalfatovic</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Building a global library of taxonomic literature</article-title>
<conf-name>In: 28th Congresso Brasileiro de Zoologia Biodiversidade e Sustentabilidade</conf-name>
<conf-date>2010</conf-date>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://www.slideshare.net/Kalfatovic/building-a-global-library-of-taxonomic-literature">http://www.slideshare.net/Kalfatovic/building-a-global-library-of-taxonomic-literature</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B13">
<label>13</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Heidorn</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Using automatically extracted information in species page retrieval</article-title>
<year>2007</year>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://scholar.google.com/scholar?hl=en&q=Tang+Heidorn+2007+using+automatically+extracted&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0">http://scholar.google.com/scholar?hl=en&q=Tang+Heidorn+2007+using+automatically+extracted&btnG=Search&as_sdt=0,22&as_ylo=&as_vis=0#0</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B14">
<label>14</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Selden</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Boufford</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Semantic annotation of biosystematics literature without training examples</article-title>
<source>
<italic>Journal of the American Society for Information Science and Technology</italic>
</source>
<year>2010</year>
<volume>61</volume>
<fpage>522</fpage>
<lpage>542</lpage>
</element-citation>
</ref>
<ref id="B15">
<label>15</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Taylor</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Extracting knowledge from biological descriptions</article-title>
<conf-name>In: Proceedings of 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases</conf-name>
<conf-date>1995</conf-date>
<fpage>114</fpage>
<lpage>119</lpage>
</element-citation>
</ref>
<ref id="B28">
<label>16</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Competency evaluation of plant character ontologies against domain literature</article-title>
<source>
<italic>Journal of the American Society for Information Science and Technology</italic>
</source>
<year>2010</year>
<volume>61</volume>
<issue>6</issue>
<fpage>1144</fpage>
<lpage>1165</lpage>
</element-citation>
</ref>
<ref id="B70">
<label>17</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miyao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Sagae</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Sætre</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Matsuzaki</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Tsujii</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Evaluating contributions of natural language parsers to protein-protein interaction extraction</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2009</year>
<volume>25</volume>
<issue>3</issue>
<fpage>394</fpage>
<lpage>400</lpage>
<pub-id pub-id-type="pmid">19073593</pub-id>
</element-citation>
</ref>
<ref id="B71">
<label>18</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Humphreys</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Demetriou</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gaizauskas</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures</article-title>
<conf-name>In: Proceedings of the Pacific Symposium on Biocomputing (PSB '00), vol. 513</conf-name>
<conf-date>2000</conf-date>
<fpage>505</fpage>
<lpage>513</lpage>
</element-citation>
</ref>
<ref id="B72">
<label>19</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gaizauskas</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Demetriou</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Artymiuk</surname>
<given-names>PJ</given-names>
</name>
<name>
<surname>Willett</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Protien structures and information extraction from biological texts: the pasta system</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2003</year>
<volume>19</volume>
<issue>1</issue>
<fpage>135</fpage>
<lpage>143</lpage>
<pub-id pub-id-type="pmid">12499303</pub-id>
</element-citation>
</ref>
<ref id="B73">
<label>20</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Divoli</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Attwood</surname>
<given-names>TK</given-names>
</name>
</person-group>
<article-title>BioIE: extracting informative sentences from the biomedical literature</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2005</year>
<volume>21</volume>
<issue>9</issue>
<fpage>2138</fpage>
<lpage>2139</lpage>
<pub-id pub-id-type="pmid">15691860</pub-id>
</element-citation>
</ref>
<ref id="B74">
<label>21</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Corney</surname>
<given-names>DPA</given-names>
</name>
<name>
<surname>Buxton</surname>
<given-names>BF</given-names>
</name>
<name>
<surname>Langdon</surname>
<given-names>WB</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>DT</given-names>
</name>
</person-group>
<article-title>BioRAT: extracting biological information from full-length papers</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2004</year>
<volume>20</volume>
<issue>17</issue>
<fpage>3206</fpage>
<lpage>3213</lpage>
<pub-id pub-id-type="pmid">15231534</pub-id>
</element-citation>
</ref>
<ref id="B75">
<label>22</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Sharp</surname>
<given-names>BM</given-names>
</name>
</person-group>
<article-title>Content-rich biological network constructed by mining PubMed abstracts</article-title>
<source>
<italic>Bmc Bioinformatics</italic>
</source>
<year>2004</year>
<volume>5, article 147</volume>
</element-citation>
</ref>
<ref id="B79">
<label>23</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>X</given-names>
</name>
</person-group>
<article-title>Dragon toolkit: incorporating auto-learned semantic knowledge into large-scale text retrieval and mining</article-title>
<conf-name>In: Proceedings of the19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '07)</conf-name>
<conf-date>October 2007</conf-date>
<fpage>197</fpage>
<lpage>201</lpage>
</element-citation>
</ref>
<ref id="B76">
<label>24</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rebholz-Schuhmann</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kirsch</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Arregui</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gaudan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Riethoven</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stoehr</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>EBIMed—text crunching to gather facts for proteins from Medline</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2007</year>
<volume>23</volume>
<issue>2</issue>
<fpage>e237</fpage>
<lpage>e244</lpage>
<pub-id pub-id-type="pmid">17237098</pub-id>
</element-citation>
</ref>
<ref id="B77">
<label>25</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>ZZ</given-names>
</name>
<name>
<surname>Mani</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Hermoso</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>CH</given-names>
</name>
</person-group>
<article-title>iProLINK: an integrated protein resource for literature mining</article-title>
<source>
<italic>Computational Biology and Chemistry</italic>
</source>
<year>2004</year>
<volume>28</volume>
<issue>5-6</issue>
<fpage>409</fpage>
<lpage>416</lpage>
<pub-id pub-id-type="pmid">15556482</pub-id>
</element-citation>
</ref>
<ref id="B78">
<label>26</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Demaine</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>L</given-names>
</name>
<name>
<surname>De Bruijn</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>LitMiner: integration of library services within a bio-informatics application</article-title>
<source>
<italic>Biomedical Digital Libraries</italic>
</source>
<year>2006</year>
<volume>3, article 11</volume>
</element-citation>
</ref>
<ref id="B16">
<label>27</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lease</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Charniak</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Parsing biomedical literature</article-title>
<conf-name>In: Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP '05)</conf-name>
<conf-date>2005</conf-date>
<conf-loc>Jeju Island, Korea</conf-loc>
</element-citation>
</ref>
<ref id="B17">
<label>28</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pyysalo</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Salakoski</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches</article-title>
<source>
<italic>BMC Bioinformatics</italic>
</source>
<year>2006</year>
<volume>7</volume>
<issue>supplement 3, article S2</issue>
</element-citation>
</ref>
<ref id="B18">
<label>29</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rimell</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Clark</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Porting a lexicalized-grammar parser to the biomedical domain</article-title>
<source>
<italic>Journal of Biomedical Informatics</italic>
</source>
<year>2009</year>
<volume>42</volume>
<issue>5</issue>
<fpage>852</fpage>
<lpage>8865</lpage>
<pub-id pub-id-type="pmid">19141332</pub-id>
</element-citation>
</ref>
<ref id="B80">
<label>30</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Converting taxonomic descriptions to new digital formats</article-title>
<source>
<italic>Biodiversity Informatics</italic>
</source>
<year>2008</year>
<volume>5</volume>
<fpage>20</fpage>
<lpage>40</lpage>
</element-citation>
</ref>
<ref id="B44">
<label>31</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koning</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sarkar</surname>
<given-names>IN</given-names>
</name>
<name>
<surname>Moritz</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>TaxonGrab: extracting taxonomic names from text</article-title>
<source>
<italic>Biodiversity Informatics</italic>
</source>
<year>2005</year>
<volume>2</volume>
<fpage>79</fpage>
<lpage>82</lpage>
</element-citation>
</ref>
<ref id="B27">
<label>32</label>
<element-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Akella</surname>
<given-names>LM</given-names>
</name>
<name>
<surname>Norton</surname>
<given-names>CN</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>NetiNeti: discovery of scientific names from text using machine learning methods</article-title>
<comment>2011</comment>
</element-citation>
</ref>
<ref id="B47">
<label>33</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gerner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Nenadic</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bergman</surname>
<given-names>CM</given-names>
</name>
</person-group>
<article-title>LINNAEUS: a species name identification system for biomedical literature</article-title>
<source>
<italic>BMC Bioinformatics</italic>
</source>
<year>2010</year>
<volume>11, article 85</volume>
</element-citation>
</ref>
<ref id="B49">
<label>34</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Naderi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Kappler</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2011</year>
<volume>27</volume>
<issue>19</issue>
<fpage>2721</fpage>
<lpage>2729</lpage>
<pub-id pub-id-type="pmid">21828087</pub-id>
</element-citation>
</ref>
<ref id="B61">
<label>35</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Abascal</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Sánchez</surname>
<given-names>JA</given-names>
</name>
</person-group>
<article-title>X-tract: structure extraction from botanical textual descriptions</article-title>
<conf-name>In: Proceeding of the String Processing & Information Retrieval Symposium & International Workshop on Groupware</conf-name>
<conf-date>September 1999</conf-date>
<conf-loc>Cancun , Mexico</conf-loc>
<publisher-name>IEEE Computer Society</publisher-name>
<fpage>2</fpage>
<lpage>7</lpage>
</element-citation>
</ref>
<ref id="B68">
<label>36</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>CharaParser for fine-grained semantic annotation of organism morphological descriptions</article-title>
<source>
<italic>Journal of the American Society for Information Science and Technology</italic>
</source>
<year>2012</year>
<volume>63</volume>
<issue>4</issue>
<fpage>738</fpage>
<lpage>754</lpage>
</element-citation>
</ref>
<ref id="B19">
<label>37</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krauthammer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rzhetsky</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Morozov</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Using BLAST for identifying gene and protein names in journal articles</article-title>
<source>
<italic>Gene</italic>
</source>
<year>2000</year>
<volume>259</volume>
<issue>1-2</issue>
<fpage>245</fpage>
<lpage>252</lpage>
<pub-id pub-id-type="pmid">11163982</pub-id>
</element-citation>
</ref>
<ref id="B20">
<label>38</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lenzi</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Frabetti</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Facchin</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<article-title>UniGene tabulator: a full parser for the UniGene format</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2006</year>
<volume>22</volume>
<issue>20</issue>
<fpage>2570</fpage>
<lpage>2571</lpage>
<pub-id pub-id-type="pmid">16895929</pub-id>
</element-citation>
</ref>
<ref id="B21">
<label>39</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Nasr</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rambow</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>Supertagging and full parsing</article-title>
<conf-name>In: Proceedings of the 7th International Workshop on Tree Adjoining Grammar and Related Formalisms (TAG '04)</conf-name>
<conf-date>2004</conf-date>
</element-citation>
</ref>
<ref id="B22">
<label>40</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Leaman</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Gonzalez</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>BANNER: an executable survey of advances in biomedical named entity recognition</article-title>
<conf-name>In: Proceedings of the Pacific Symposium on Biocomputing (PSB '08)</conf-name>
<conf-date>January 2008</conf-date>
<conf-loc>Kona, Hawaii, USA</conf-loc>
<fpage>652</fpage>
<lpage>663</lpage>
</element-citation>
</ref>
<ref id="B23">
<label>41</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Schröder</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Knowledge-based processing of medical language: a language engineering approach</article-title>
<conf-name>In: Proceedings of the16th German Conference on Artificial Intelligence (GWAI '92), vol. 671</conf-name>
<conf-date>August-September 1992</conf-date>
<conf-loc>Bonn, Germany</conf-loc>
<fpage>221</fpage>
<lpage>234</lpage>
</element-citation>
</ref>
<ref id="B24">
<label>42</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Witten</surname>
<given-names>IH</given-names>
</name>
<name>
<surname>Frank</surname>
<given-names>E</given-names>
</name>
</person-group>
<source>
<italic>Data Mining: Practical Machine Learning Tools and Techniques</italic>
</source>
<year>2005</year>
<edition>2nd edition</edition>
<publisher-name>Morgan Kaufmann</publisher-name>
<series>Morgan Kaufmann Series in Data Management Systems</series>
</element-citation>
</ref>
<ref id="B25">
<label>43</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blaschke</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Hirschman</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Valencia</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Information extraction in molecular biology</article-title>
<source>
<italic>Briefings in Bioinformatics</italic>
</source>
<year>2002</year>
<volume>3</volume>
<issue>2</issue>
<fpage>154</fpage>
<lpage>165</lpage>
<pub-id pub-id-type="pmid">12139435</pub-id>
</element-citation>
</ref>
<ref id="B26">
<label>44</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Jimeno-Yepes</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Aronson</surname>
<given-names>AR</given-names>
</name>
</person-group>
<article-title>Self-training and co-training in biomedical word sense disambiguation</article-title>
<conf-name>Proceedings of the Workshop on Biomedical Natural Language Processing (ACL-HLT '11)</conf-name>
<conf-date>2011</conf-date>
<conf-loc>Portland, Oregon</conf-loc>
<fpage>182</fpage>
<lpage>183</lpage>
</element-citation>
</ref>
<ref id="B29">
<label>45</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Freeland</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>An evaluation of taxonomic name finding & next steps in Biodiversity Heritage Library (BHL) developments</article-title>
<source>
<italic>Nature Precedings</italic>
</source>
<year>2009</year>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://precedings.nature.com/documents/3372/version/1">http://precedings.nature.com/documents/3372/version/1</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B30">
<label>46</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kornai</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Experimental hmm-based postal ocr system</article-title>
<conf-name>In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), vol. 4</conf-name>
<conf-date>April 1997</conf-date>
<fpage>3177</fpage>
<lpage>3180</lpage>
</element-citation>
</ref>
<ref id="B31">
<label>47</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kornai</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mohiuddin</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Connell</surname>
<given-names>SD</given-names>
</name>
</person-group>
<article-title>Recognition of cursive writing on personal checks</article-title>
<conf-name>In: Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition</conf-name>
<conf-date>1996</conf-date>
<conf-loc>Essex, UK</conf-loc>
<publisher-name>Citeseer</publisher-name>
<fpage>373</fpage>
<lpage>378</lpage>
</element-citation>
</ref>
<ref id="B32">
<label>48</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Freeland</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing.</article-title>
<source>
<italic>BioSystematics Berlin</italic>
</source>
<year>2011</year>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://www.slideshare.net/chrisfreeland/digitization-and-enhancement-of-biodiversity-literature-through-ocr-scientific-names-mapping-and-crowdsourcing">http://www.slideshare.net/chrisfreeland/digitization-and-enhancement-of-biodiversity-literature-through-ocr-scientific-names-mapping-and-crowdsourcing</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B33">
<label>49</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Willis</surname>
<given-names>A</given-names>
</name>
<name>
<surname>King</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Morse</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Dil</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lyal</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Roberts</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>From XML to XML: the why and how of making the biodiversity literature accessible to researchers</article-title>
<conf-name>In: Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC '10)</conf-name>
<conf-date>May 2010</conf-date>
<conf-loc>Valletta, Malta</conf-loc>
<publisher-name>European Language Resources Association (ELRA)</publisher-name>
<fpage>1237</fpage>
<lpage>1244</lpage>
</element-citation>
</ref>
<ref id="B34">
<label>50</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Bapst</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Ingold</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Using typography in document image analysis</article-title>
<conf-name>In: Proceedings of Raster Imaging and Digital Typography (RIDT '98)</conf-name>
<conf-date>March-April 1998</conf-date>
<conf-loc>Saint-Malo, France</conf-loc>
<fpage>240</fpage>
<lpage>251</lpage>
</element-citation>
</ref>
<ref id="B35">
<label>51</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Weitzman</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Lyal</surname>
<given-names>CHC</given-names>
</name>
</person-group>
<source>
<italic>An XML Schema for Taxonomic Literature—TaXMLit</italic>
</source>
<year>2004</year>
<comment>
<ext-link ext-link-type="uri" xlink:href="http://www.sil.si.edu/digitalcollections/bca/documentation/taXMLitv1-3Intro.pdf">http://www.sil.si.edu/digitalcollections/bca/documentation/taXMLitv1-3Intro.pdf</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B36">
<label>52</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Rees</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases</article-title>
<conf-name>In: Proceedings of TDWG</conf-name>
<conf-date>2008</conf-date>
<comment>pp. 35,
<ext-link ext-link-type="uri" xlink:href="http://www.tdwg.org/fileadmin/2008conference/documents/Proceedings2008.pdf#page=35">http://www.tdwg.org/fileadmin/2008conference/documents/Proceedings2008.pdf#page=35</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B37">
<label>53</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sautter</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Böhm</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Agosti</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Semi-automated xml markup of biosystematic legacy literature with the goldengate editor</article-title>
<conf-name>In: Proceedings of the Pacific Symposium on Biocomputing (PSB '07)</conf-name>
<conf-date>2007</conf-date>
<publisher-name>World Scientific</publisher-name>
<fpage>391</fpage>
<lpage>402</lpage>
</element-citation>
</ref>
<ref id="B38">
<label>54</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Settles</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2005</year>
<volume>21</volume>
<issue>14</issue>
<fpage>3191</fpage>
<lpage>3192</lpage>
<pub-id pub-id-type="pmid">15860559</pub-id>
</element-citation>
</ref>
<ref id="B39">
<label>55</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pavlopoulos</surname>
<given-names>GA</given-names>
</name>
<name>
<surname>Pafilis</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Kuhn</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hooper</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>OnTheFly: a tool for automated document-based text annotation, data linking and network generation</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2009</year>
<volume>25</volume>
<issue>7</issue>
<fpage>977</fpage>
<lpage>978</lpage>
<pub-id pub-id-type="pmid">19223449</pub-id>
</element-citation>
</ref>
<ref id="B40">
<label>56</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pafilis</surname>
<given-names>E</given-names>
</name>
<name>
<surname>O’Donoghue</surname>
<given-names>SI</given-names>
</name>
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Reflect: augmented browsing for the life scientist</article-title>
<source>
<italic>Nature Biotechnology</italic>
</source>
<year>2009</year>
<volume>27</volume>
<issue>6</issue>
<fpage>508</fpage>
<lpage>510</lpage>
</element-citation>
</ref>
<ref id="B41">
<label>57</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuhn</surname>
<given-names>M</given-names>
</name>
<name>
<surname>von Mering</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Campillos</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>STITCH: interaction networks of chemicals and proteins</article-title>
<source>
<italic>Nucleic Acids Research</italic>
</source>
<year>2008</year>
<volume>36</volume>
<issue>1</issue>
<fpage>D684</fpage>
<lpage>D688</lpage>
<pub-id pub-id-type="pmid">18084021</pub-id>
</element-citation>
</ref>
<ref id="B42">
<label>58</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Balhoff</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Dahdul</surname>
<given-names>WM</given-names>
</name>
<name>
<surname>Kothari</surname>
<given-names>CR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Phenex: ontological annotation of phenotypic diversity</article-title>
<source>
<italic>Plos ONE</italic>
</source>
<year>2010</year>
<volume>5</volume>
<issue>5, article e10500</issue>
</element-citation>
</ref>
<ref id="B43">
<label>59</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dahdul</surname>
<given-names>WM</given-names>
</name>
<name>
<surname>Balhoff</surname>
<given-names>JP</given-names>
</name>
<name>
<surname>Engeman</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature</article-title>
<source>
<italic>Plos ONE</italic>
</source>
<year>2010</year>
<volume>5</volume>
<issue>5</issue>
<comment>Article ID e10708.</comment>
</element-citation>
</ref>
<ref id="B45">
<label>60</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sautter</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bohm</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Agosti</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>A combining approach to find all taxon names (FAT) in legacy biosystematics literature</article-title>
<source>
<italic>Biodiversity Informatics</italic>
</source>
<year>2007</year>
<volume>3</volume>
<fpage>46</fpage>
<lpage>58</lpage>
</element-citation>
</ref>
<ref id="B46">
<label>61</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leary</surname>
<given-names>PR</given-names>
</name>
<name>
<surname>Remsen</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Norton</surname>
<given-names>CN</given-names>
</name>
<name>
<surname>Patterson</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Sarkar</surname>
<given-names>IN</given-names>
</name>
</person-group>
<article-title>UbioRSS: tracking taxonomic literature using RSS</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2007</year>
<volume>23</volume>
<issue>11</issue>
<fpage>1434</fpage>
<lpage>1436</lpage>
<pub-id pub-id-type="pmid">17392332</pub-id>
</element-citation>
</ref>
<ref id="B48">
<label>62</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Okazaki</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Ananiadou</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Building an abbreviation dictionary using a term recognition approach</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2006</year>
<volume>22</volume>
<issue>24</issue>
<fpage>3089</fpage>
<lpage>3095</lpage>
<pub-id pub-id-type="pmid">17050571</pub-id>
</element-citation>
</ref>
<ref id="B50">
<label>63</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bontcheva</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Tablan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Maynard</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Cunningham</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Evolving gate to meet new challenges in language engineering</article-title>
<source>
<italic>Natural Language Engineering</italic>
</source>
<year>2004</year>
<volume>10</volume>
<issue>3-4</issue>
<fpage>349</fpage>
<lpage>373</lpage>
</element-citation>
</ref>
<ref id="B51">
<label>64</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Cunningham</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Maynard</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bontcheva</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Tablan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Ursu</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<source>
<italic>Developing Language Processing Components with GATE (A User Guide)</italic>
</source>
<year>2006</year>
<publisher-name>University of Sheffield</publisher-name>
</element-citation>
</ref>
<ref id="B52">
<label>65</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Fitzpatrick</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Bachenko</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hindle</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>The status of telegraphic sublanguages</article-title>
<source>
<italic>Analyzing Language in Restricted Domains: Sublanguage Description and Processing</italic>
</source>
<year>1986</year>
<fpage>39</fpage>
<lpage>51</lpage>
</element-citation>
</ref>
<ref id="B53">
<label>66</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lydon</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Tablan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Maynard</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Cunningham</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Populating a database from parallel texts using ontology-based information extraction</article-title>
<source>
<italic>Natural Language Processing and Information Systems</italic>
</source>
<year>2004</year>
<volume>3136</volume>
<fpage>357</fpage>
<lpage>365</lpage>
</element-citation>
</ref>
<ref id="B54">
<label>67</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Gene name ambiguity of eukaryotic nomenclatures</article-title>
<source>
<italic>Bioinformatics</italic>
</source>
<year>2005</year>
<volume>21</volume>
<issue>2</issue>
<fpage>248</fpage>
<lpage>256</lpage>
<pub-id pub-id-type="pmid">15333458</pub-id>
</element-citation>
</ref>
<ref id="B55">
<label>68</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Hatzivassiloglou</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Wilbur</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles</article-title>
<source>
<italic>Journal of Biomedical Informatics</italic>
</source>
<year>2007</year>
<volume>40</volume>
<issue>2</issue>
<fpage>150</fpage>
<lpage>159</lpage>
<pub-id pub-id-type="pmid">16843731</pub-id>
</element-citation>
</ref>
<ref id="B56">
<label>69</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Chang</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Schutze</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Abbreviations in biomedical text</article-title>
<source>
<italic>Text Mining for Biology and Biomedicine</italic>
</source>
<year>2006</year>
<fpage>99</fpage>
<lpage>119</lpage>
</element-citation>
</ref>
<ref id="B57">
<label>70</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wren</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Garner</surname>
<given-names>HR</given-names>
</name>
</person-group>
<article-title>Heuristics for identification of acronym-definition patterns within text: towards an automated construction of comprehensive acronym-definition dictionaries</article-title>
<source>
<italic>Methods of Information in Medicine</italic>
</source>
<year>2002</year>
<volume>41</volume>
<issue>5</issue>
<fpage>426</fpage>
<lpage>434</lpage>
<pub-id pub-id-type="pmid">12501816</pub-id>
</element-citation>
</ref>
<ref id="B58">
<label>71</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lydon</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Wood</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Data patterns in multiple botanical descriptions: implications for automatic processing of legacy data</article-title>
<source>
<italic>Systematics and Biodiversity</italic>
</source>
<year>2003</year>
<volume>1</volume>
<issue>2</issue>
<fpage>151</fpage>
<lpage>157</lpage>
</element-citation>
</ref>
<ref id="B59">
<label>72</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Taylor</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Using prolog for biological descriptions</article-title>
<conf-name>In: Proceedings of The 3rd international Conference on the Practical Application of Prolog</conf-name>
<conf-date>1995</conf-date>
<fpage>587</fpage>
<lpage>597</lpage>
</element-citation>
</ref>
<ref id="B60">
<label>73</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Radford</surname>
<given-names>AE</given-names>
</name>
</person-group>
<source>
<italic>Fundamentals of Plant Systematics</italic>
</source>
<year>1986</year>
<publisher-loc>New York, NY, USA</publisher-loc>
<publisher-name>Harper & Row</publisher-name>
</element-citation>
</ref>
<ref id="B62">
<label>74</label>
<element-citation publication-type="other">
<person-group person-group-type="author">
<name>
<surname>Diederich</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fortuner</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Milton</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Computer-assisted data extraction from the taxonomical literature</article-title>
<comment>1999,
<ext-link ext-link-type="uri" xlink:href="http://math.ucdavis.edu/~milton/genisys.html">http://math.ucdavis.edu/~milton/genisys.html</ext-link>
</comment>
</element-citation>
</ref>
<ref id="B63">
<label>75</label>
<element-citation publication-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lydon</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Tablan</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Maynard</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Cunningham</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Using parallel texts to improve recall in IE</article-title>
<conf-name>In: Proceedings of Recent Advances in Natural Language Processing (RANLP '03)</conf-name>
<conf-date>2003</conf-date>
<conf-loc>Borovetz, Bulgaria</conf-loc>
<fpage>505</fpage>
<lpage>512</lpage>
</element-citation>
</ref>
<ref id="B64">
<label>76</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Heidorn</surname>
<given-names>PB</given-names>
</name>
</person-group>
<article-title>The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions</article-title>
<source>
<italic>Journal of the American Society for Information Science and Technology</italic>
</source>
<year>2007</year>
<volume>58</volume>
<issue>1</issue>
<fpage>133</fpage>
<lpage>149</lpage>
</element-citation>
</ref>
<ref id="B65">
<label>77</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Wei</surname>
<given-names>Q</given-names>
</name>
</person-group>
<source>
<italic>Information fusion in taxonomic descriptions</italic>
</source>
<year>2011</year>
<publisher-loc>Champaign, Ill, USA</publisher-loc>
<publisher-name>University of Illinois at Urbana-Champaign</publisher-name>
<comment>Ph.D. thesis</comment>
</element-citation>
</ref>
<ref id="B66">
<label>78</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Soderland</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Learning information extraction rules for semi-structured and free text</article-title>
<source>
<italic>Machine Learning</italic>
</source>
<year>1999</year>
<volume>34</volume>
<issue>1</issue>
<fpage>233</fpage>
<lpage>272</lpage>
</element-citation>
</ref>
<ref id="B67">
<label>79</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Singaram</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Janning</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Combine unsupervised learning and heuristic rules to annotate morphological characters</article-title>
<source>
<italic>Proceedings of the American Society for Information Science and Technology</italic>
</source>
<year>2011</year>
<volume>48</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>9</lpage>
</element-citation>
</ref>
<ref id="B69">
<label>80</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mabee</surname>
<given-names>PM</given-names>
</name>
<name>
<surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Cronk</surname>
<given-names>Q</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Phenotype ontologies: the bridge between genomics and evolution</article-title>
<source>
<italic>Trends in Ecology and Evolution</italic>
</source>
<year>2007</year>
<volume>22</volume>
<issue>7</issue>
<fpage>345</fpage>
<lpage>350</lpage>
<pub-id pub-id-type="pmid">17416439</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
<floats-group>
<fig id="fig1" position="float">
<label>Figure 1</label>
<caption>
<p>The long tail of biology. Data quantity, digitization, and openness can be described using a hyperbolic (hollow) curve with a small number of providers providing large quantities of data, and a large number of individuals providing small quantities of data.</p>
</caption>
<graphic xlink:href="ABI2012-391574.001"></graphic>
</fig>
<fig id="fig2" position="float">
<label>Figure 2</label>
<caption>
<p>A reference system architecture for an example IE system. Numbers correspond to the text.</p>
</caption>
<graphic xlink:href="ABI2012-391574.002"></graphic>
</fig>
<fig id="fig3" position="float">
<label>Figure 3</label>
<caption>
<p>An example of shallow parsing. Words and a sentence (S) are recognized. Then, the sentence is parsed into noun phrases (NP), verbs (V), and verb phrases (VP).</p>
</caption>
<graphic xlink:href="ABI2012-391574.003"></graphic>
</fig>
<fig id="fig4" position="float">
<label>Figure 4</label>
<caption>
<p>Shallow-vs-Deep-Parsing. The shallow parsing result produced by GENIA Tagger (
<ext-link ext-link-type="uri" xlink:href="http://text0.mib.man.ac.uk/software/geniatagger/">http://text0.mib.man.ac.uk/software/geniatagger/</ext-link>
). The deep parsing result produced by Enju Parser for Biomedical Domain (
<ext-link ext-link-type="uri" xlink:href="http://www-tsujii.is.s.u-tokyo.ac.jp/enju/demo.html">http://www-tsujii.is.s.u-tokyo.ac.jp/enju/demo.html</ext-link>
). GENIA Tagger and Enju Parser are products of the Tsujii Laboratory of the University of Tokyo and optimized for biomedical domain. Both Parsing results contain errors, for example “obovate” should be an ADJP (adjective phrase), but GENIA Tagger chunked it as a VP (verb phrase). “blade” is a noun, but Enju parser parsed it as a verb (VBD). This is not to criticize the tools, but to point out language differences in different domains could have a significant impact on the performance of NLP tools. Parsers trained for a general domain produce erroneous results on morphological descriptions [
<xref ref-type="bibr" rid="B28">16</xref>
].</p>
</caption>
<graphic xlink:href="ABI2012-391574.004"></graphic>
</fig>
<fig id="fig5" position="float">
<label>Figure 5</label>
<caption>
<p>Extraction result from a descriptive sentence.</p>
</caption>
<graphic xlink:href="ABI2012-391574.005"></graphic>
</fig>
<fig id="figbox1" position="float">
<label>Box 1</label>
<graphic xlink:href="ABI2012-391574.006"></graphic>
</fig>
<fig id="figbox2" position="float">
<label>Box 2</label>
<graphic xlink:href="ABI2012-391574.007"></graphic>
</fig>
<fig id="figbox3" position="float">
<label>Box 3</label>
<graphic xlink:href="ABI2012-391574.008"></graphic>
</fig>
<fig id="figbox4" position="float">
<label>Box 4</label>
<graphic xlink:href="ABI2012-391574.009"></graphic>
</fig>
<table-wrap id="tab1" position="float">
<label>Table 1</label>
<caption>
<p>From Tang and Heidorn [
<xref ref-type="bibr" rid="B13">13</xref>
]. An example template for morphological character extraction. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Template slots</th>
<th align="center" rowspan="1" colspan="1">Extracted information</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Genus</td>
<td align="center" rowspan="1" colspan="1">Pellaea</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Species</td>
<td align="center" rowspan="1" colspan="1">mucronata</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Distribution</td>
<td align="center" rowspan="1" colspan="1">Nev. Calif.</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Leaf shape</td>
<td align="center" rowspan="1" colspan="1">ovate-deltate</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Leaf margin</td>
<td align="center" rowspan="1" colspan="1">dentate</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Leaf apex</td>
<td align="center" rowspan="1" colspan="1">mucronate</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Leaf base</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Leaf arrangement</td>
<td align="center" rowspan="1" colspan="1">clustered</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Blade dimension</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Leaf color</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Fruit/nut shape</td>
<td align="center" rowspan="1" colspan="1"></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="tab2" position="float">
<label>Table 2</label>
<caption>
<p>Information extraction tasks outlined by the MUCs and their descriptions.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Task</th>
<th align="center" rowspan="1" colspan="1">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Named entity</td>
<td align="center" rowspan="1" colspan="1">Extracts names of entities</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Coreference</td>
<td align="center" rowspan="1" colspan="1">Links references to the same entity</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Template element</td>
<td align="center" rowspan="1" colspan="1">Extracts descriptors of entities</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Template rotation</td>
<td align="center" rowspan="1" colspan="1">Extracts relationships between entities</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Scenario template</td>
<td align="center" rowspan="1" colspan="1">Extracts events</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="tab3" position="float">
<label>Table 3</label>
<caption>
<p>Existing IE systems for biology [
<xref ref-type="bibr" rid="B70">17</xref>
<xref ref-type="bibr" rid="B78">26</xref>
]. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">System</th>
<th align="left" rowspan="1" colspan="1">Approach</th>
<th align="left" rowspan="1" colspan="1">Structure of Text</th>
<th align="left" rowspan="1" colspan="1">Knowledge in</th>
<th align="left" rowspan="1" colspan="1">Application domain</th>
<th align="left" rowspan="1" colspan="1">Reference</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">AkanePPI</td>
<td align="left" rowspan="1" colspan="1">shallow parsing</td>
<td align="left" rowspan="1" colspan="1">sentence-split, tokenized, and annotated</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">protein interactions</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B70">17</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">EMPathIE</td>
<td align="left" rowspan="1" colspan="1">pattern matching</td>
<td align="left" rowspan="1" colspan="1">text</td>
<td align="left" rowspan="1" colspan="1">EMP database</td>
<td align="left" rowspan="1" colspan="1">enzymes</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B71">18</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">PASTA</td>
<td align="left" rowspan="1" colspan="1">pattern matching</td>
<td align="left" rowspan="1" colspan="1">text</td>
<td align="left" rowspan="1" colspan="1">biological lexicons</td>
<td align="left" rowspan="1" colspan="1">protein structure</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B72">19</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">BioIE</td>
<td align="left" rowspan="1" colspan="1">pattern matching</td>
<td align="left" rowspan="1" colspan="1">xml</td>
<td align="left" rowspan="1" colspan="1">dictionary of terms</td>
<td align="left" rowspan="1" colspan="1">biomedicine</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B73">20</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">BioRAT</td>
<td align="left" rowspan="1" colspan="1">pattern matching, sub-language driven</td>
<td align="left" rowspan="1" colspan="1">could be xml, html, text or asn.1, can do full-length pdf papers (converts to text)</td>
<td align="left" rowspan="1" colspan="1">dictionary for protein and gene names, dictionary for interactions, and synonyms; text pattern template</td>
<td align="left" rowspan="1" colspan="1">biomedicine</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B74">21</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Chilibot</td>
<td align="left" rowspan="1" colspan="1">shallow parsing</td>
<td align="left" rowspan="1" colspan="1">not sure what was used in paper, but could be xml, html, text or asn.1</td>
<td align="left" rowspan="1" colspan="1">nomenclature dictionary</td>
<td align="left" rowspan="1" colspan="1">biomedicine</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B75">22</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Dragon Toolkit</td>
<td align="left" rowspan="1" colspan="1">mixed syntactic semantic</td>
<td align="left" rowspan="1" colspan="1">text</td>
<td align="left" rowspan="1" colspan="1">domain ontologies</td>
<td align="left" rowspan="1" colspan="1">genomics</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B79">23</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">EBIMed</td>
<td align="left" rowspan="1" colspan="1">pattern matching</td>
<td align="left" rowspan="1" colspan="1">xml</td>
<td align="left" rowspan="1" colspan="1">dictionary of terms</td>
<td align="left" rowspan="1" colspan="1">biomedicine</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B76">24</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">iProLINK</td>
<td align="left" rowspan="1" colspan="1">shallow parsing</td>
<td align="left" rowspan="1" colspan="1">text</td>
<td align="left" rowspan="1" colspan="1">protein name dictionary, ontology, and annotated corpora</td>
<td align="left" rowspan="1" colspan="1">proteins</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B77">25</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">LitMiner</td>
<td align="left" rowspan="1" colspan="1">mixed syntactic semantic</td>
<td align="left" rowspan="1" colspan="1">web documents</td>
<td align="left" rowspan="1" colspan="1"></td>
<td align="left" rowspan="1" colspan="1">Drosophila research</td>
<td align="left" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B78">26</xref>
]</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="tab4" position="float">
<label>Table 4</label>
<caption>
<p>Performance metrics for the names recognition and morphological character extraction algorithms reviewed. Recall and precision values may not be directly comparable between the different algorithms. NA: not available [
<xref ref-type="bibr" rid="B80">30</xref>
]. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Tool</th>
<th align="center" rowspan="1" colspan="1">Recall</th>
<th align="center" rowspan="1" colspan="1">Precision</th>
<th align="center" rowspan="1" colspan="1">Test Corpora</th>
<th align="center" rowspan="1" colspan="1">Reference</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">TaxonGrab</td>
<td align="center" rowspan="1" colspan="1">>94%</td>
<td align="center" rowspan="1" colspan="1">>96%</td>
<td align="center" rowspan="1" colspan="1">Vol. 1 Birds of the Belgian Congo by Chapin</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B44">31</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">FAT</td>
<td align="center" rowspan="1" colspan="1">40.2%</td>
<td align="center" rowspan="1" colspan="1">84.0%</td>
<td align="center" rowspan="1" colspan="1">American Seashells by Abbott</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B27">32</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Taxon Finder</td>
<td align="center" rowspan="1" colspan="1">54.3%</td>
<td align="center" rowspan="1" colspan="1">97.5%</td>
<td align="center" rowspan="1" colspan="1">American Seashells by Abbott</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B27">32</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Neti Neti</td>
<td align="center" rowspan="1" colspan="1">70.5%</td>
<td align="center" rowspan="1" colspan="1">98.9%</td>
<td align="center" rowspan="1" colspan="1">American Seashells by Abbott</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B27">32</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">LINNAEUS</td>
<td align="center" rowspan="1" colspan="1">94.3%</td>
<td align="center" rowspan="1" colspan="1">97.1%</td>
<td align="center" rowspan="1" colspan="1">LINNAEUS gold standard data set</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B47">33</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Organism Tagger</td>
<td align="center" rowspan="1" colspan="1">94.0%</td>
<td align="center" rowspan="1" colspan="1">95.0%</td>
<td align="center" rowspan="1" colspan="1">LINNAEUS gold standard data set</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B49">34</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">X-tract</td>
<td align="center" rowspan="1" colspan="1">NA</td>
<td align="center" rowspan="1" colspan="1">NA</td>
<td align="center" rowspan="1" colspan="1">Flora of North America</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B61">35</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Worldwide Botanical Knowledge Base</td>
<td align="center" rowspan="1" colspan="1">NA</td>
<td align="center" rowspan="1" colspan="1">NA</td>
<td align="center" rowspan="1" colspan="1">Flora of China</td>
<td align="center" rowspan="1" colspan="1">
<ext-link ext-link-type="uri" xlink:href="http://wwbota.free.fr/">http://wwbota.free.fr/</ext-link>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Terminator</td>
<td align="center" rowspan="1" colspan="1">NA</td>
<td align="center" rowspan="1" colspan="1">NA</td>
<td align="center" rowspan="1" colspan="1">16 nematode descriptions</td>
<td align="center" rowspan="1" colspan="1">
<ext-link ext-link-type="uri" xlink:href="http://www.math.ucdavis.edu/~milton/genisys/terminator.html">http://www.math.ucdavis.edu/~milton/genisys/terminator.html</ext-link>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MultiFlora</td>
<td align="center" rowspan="1" colspan="1">mid 60%</td>
<td align="center" rowspan="1" colspan="1">mid 70%</td>
<td align="center" rowspan="1" colspan="1">Descriptions of Ranunculus spp. from six Floras</td>
<td align="center" rowspan="1" colspan="1">
<ext-link ext-link-type="uri" xlink:href="http://intranet.cs.man.ac.uk/ai/public/MultiFlora/MF1.html">http://intranet.cs.man.ac.uk/ai/public/MultiFlora/MF1.html</ext-link>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">MARTT</td>
<td align="center" rowspan="1" colspan="1">98.0%</td>
<td align="center" rowspan="1" colspan="1">58.0%</td>
<td align="center" rowspan="1" colspan="1">Flora of North America and Flora of China</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B80">30</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">WHISK</td>
<td align="center" rowspan="1" colspan="1">33.33% to 79.65%</td>
<td align="center" rowspan="1" colspan="1">72.52% to 100%</td>
<td align="center" rowspan="1" colspan="1">Flora of North America</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B13">13</xref>
]</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">CharaParser</td>
<td align="center" rowspan="1" colspan="1">90.0%</td>
<td align="center" rowspan="1" colspan="1">91.0%</td>
<td align="center" rowspan="1" colspan="1">Flora of North America</td>
<td align="center" rowspan="1" colspan="1">[
<xref ref-type="bibr" rid="B68">36</xref>
]</td>
</tr>
</tbody>
</table>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000210 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000210 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:3364545
   |texte=   Applications of Natural Language Processing in Biodiversity Science
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:22685456" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024