Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases

Identifieur interne : 000083 ( Main/Merge ); précédent : 000082; suivant : 000084

Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases

Auteurs : Tony Rees

Source :

RBID : PMC:4172526

Abstract

Misspellings of organism scientific names create barriers to optimal storage and organization of biological data, reconciliation of data stored under different spelling variants of the same name, and appropriate responses from user queries to taxonomic data systems. This study presents an analysis of the nature of the problem from first principles, reviews some available algorithmic approaches, and describes Taxamatch, an improved name matching solution for this information domain. Taxamatch employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms n-grams (as bigrams and trigrams) and standard edit distance. Although entirely phonetic methods are faster than Taxamatch, they are inferior in the area of recall since many real-world errors are non-phonetic in nature. Excellent performance of Taxamatch (as recall, precision and execution time) is demonstrated against a reference database of over 465,000 genus names and 1.6 million species names, as well as against a range of error types as present at both genus and species levels in three sets of sample data for species and four for genera alone. An ancillary authority matching component is included which can be used both for misspelled names and for otherwise matching names where the associated cited authorities are not identical.


Url:
DOI: 10.1371/journal.pone.0107510
PubMed: 25247892
PubMed Central: 4172526

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:4172526

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases</title>
<author>
<name sortKey="Rees, Tony" sort="Rees, Tony" uniqKey="Rees T" first="Tony" last="Rees">Tony Rees</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25247892</idno>
<idno type="pmc">4172526</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4172526</idno>
<idno type="RBID">PMC:4172526</idno>
<idno type="doi">10.1371/journal.pone.0107510</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000181</idno>
<idno type="wicri:Area/Pmc/Curation">000181</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000036</idno>
<idno type="wicri:Area/Ncbi/Merge">000210</idno>
<idno type="wicri:Area/Ncbi/Curation">000210</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000210</idno>
<idno type="wicri:Area/Main/Merge">000083</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases</title>
<author>
<name sortKey="Rees, Tony" sort="Rees, Tony" uniqKey="Rees T" first="Tony" last="Rees">Tony Rees</name>
<affiliation>
<nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Misspellings of organism scientific names create barriers to optimal storage and organization of biological data, reconciliation of data stored under different spelling variants of the same name, and appropriate responses from user queries to taxonomic data systems. This study presents an analysis of the nature of the problem from first principles, reviews some available algorithmic approaches, and describes Taxamatch, an improved name matching solution for this information domain. Taxamatch employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms
<italic>n</italic>
-grams (as bigrams and trigrams) and standard edit distance. Although entirely phonetic methods are faster than Taxamatch, they are inferior in the area of recall since many real-world errors are non-phonetic in nature. Excellent performance of Taxamatch (as recall, precision and execution time) is demonstrated against a reference database of over 465,000 genus names and 1.6 million species names, as well as against a range of error types as present at both genus and species levels in three sets of sample data for species and four for genera alone. An ancillary authority matching component is included which can be used both for misspelled names and for otherwise matching names where the associated cited authorities are not identical.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Patterson, Dj" uniqKey="Patterson D">DJ Patterson</name>
</author>
<author>
<name sortKey="Cooper, J" uniqKey="Cooper J">J Cooper</name>
</author>
<author>
<name sortKey="Kirk, Pm" uniqKey="Kirk P">PM Kirk</name>
</author>
<author>
<name sortKey="Pyle, Rl" uniqKey="Pyle R">RL Pyle</name>
</author>
<author>
<name sortKey="Remsen, Dp" uniqKey="Remsen D">DP Remsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hernandez, Ma" uniqKey="Hernandez M">MA Hernández</name>
</author>
<author>
<name sortKey="Stolfo, Sj" uniqKey="Stolfo S">SJ Stolfo</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carvalho, Gh" uniqKey="Carvalho G">GH Carvalho</name>
</author>
<author>
<name sortKey="Cianciaruso, Mv" uniqKey="Cianciaruso M">MV Cianciaruso</name>
</author>
<author>
<name sortKey="Batalha, Ma" uniqKey="Batalha M">MA Batalha</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cayuela, L" uniqKey="Cayuela L">L Cayuela</name>
</author>
<author>
<name sortKey="Granzow De La Cerda, I" uniqKey="Granzow De La Cerda I">I Granzow-de la Cerda</name>
</author>
<author>
<name sortKey="Albuquerque, Fs" uniqKey="Albuquerque F">FS Albuquerque</name>
</author>
<author>
<name sortKey="Golicher, Dj" uniqKey="Golicher D">DJ Golicher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kluyver, Ta" uniqKey="Kluyver T">TA Kluyver</name>
</author>
<author>
<name sortKey="Osborne, Cp" uniqKey="Osborne C">CP Osborne</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hall, Pav" uniqKey="Hall P">PAV Hall</name>
</author>
<author>
<name sortKey="Dowling, Gr" uniqKey="Dowling G">GR Dowling</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Porter, Eh" uniqKey="Porter E">EH Porter</name>
</author>
<author>
<name sortKey="Winkler, We" uniqKey="Winkler W">WE Winkler</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Raup, Dm" uniqKey="Raup D">DM Raup</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="David, N" uniqKey="David N">N David</name>
</author>
<author>
<name sortKey="Gosselin, M" uniqKey="Gosselin M">M Gosselin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kukich, K" uniqKey="Kukich K">K Kukich</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Banks, Rc" uniqKey="Banks R">RC Banks</name>
</author>
<author>
<name sortKey="Cicero, C" uniqKey="Cicero C">C Cicero</name>
</author>
<author>
<name sortKey="Dunn, Jl" uniqKey="Dunn J">JL Dunn</name>
</author>
<author>
<name sortKey="Kratter, Aw" uniqKey="Kratter A">AW Kratter</name>
</author>
<author>
<name sortKey="Rasmussen, Pc" uniqKey="Rasmussen P">PC Rasmussen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Damerau, F" uniqKey="Damerau F">F Damerau</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Levenshtein, V" uniqKey="Levenshtein V">V Levenshtein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wagner, Ra" uniqKey="Wagner R">RA Wagner</name>
</author>
<author>
<name sortKey="Fischer, Mj" uniqKey="Fischer M">MJ Fischer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lowrance, R" uniqKey="Lowrance R">R Lowrance</name>
</author>
<author>
<name sortKey="Wagner, R" uniqKey="Wagner R">R Wagner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robertson, Am" uniqKey="Robertson A">AM Robertson</name>
</author>
<author>
<name sortKey="Willett, P" uniqKey="Willett P">P Willett</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yannakoudakis, Ej" uniqKey="Yannakoudakis E">EJ Yannakoudakis</name>
</author>
<author>
<name sortKey="Fawthrop, D" uniqKey="Fawthrop D">D Fawthrop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yannakoudakis, Ej" uniqKey="Yannakoudakis E">EJ Yannakoudakis</name>
</author>
<author>
<name sortKey="Fawthrop, D" uniqKey="Fawthrop D">D Fawthrop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author>
<name sortKey="Ciura, Mg" uniqKey="Ciura M">MG Ciura</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Senger, C" uniqKey="Senger C">C Senger</name>
</author>
<author>
<name sortKey="Kaltschmidt, J" uniqKey="Kaltschmidt J">J Kaltschmidt</name>
</author>
<author>
<name sortKey="Schmitt, Spw" uniqKey="Schmitt S">SPW Schmitt</name>
</author>
<author>
<name sortKey="Pruszydlo, Mg" uniqKey="Pruszydlo M">MG Pruszydlo</name>
</author>
<author>
<name sortKey="Haefeli, We" uniqKey="Haefeli W">WE Haefeli</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zobel, J" uniqKey="Zobel J">J Zobel</name>
</author>
<author>
<name sortKey="Dart, P" uniqKey="Dart P">P Dart</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schulz, Ku" uniqKey="Schulz K">KU Schulz</name>
</author>
<author>
<name sortKey="Mihov, S" uniqKey="Mihov S">S Mihov</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Froese, R" uniqKey="Froese R">R Froese</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Boyle, B" uniqKey="Boyle B">B Boyle</name>
</author>
<author>
<name sortKey="Hopkins, N" uniqKey="Hopkins N">N Hopkins</name>
</author>
<author>
<name sortKey="Lu, Z" uniqKey="Lu Z">Z Lu</name>
</author>
<author>
<name sortKey="Garay, Jar" uniqKey="Garay J">JAR Garay</name>
</author>
<author>
<name sortKey="Mozzherin, D" uniqKey="Mozzherin D">D Mozzherin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000083 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000083 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     PMC:4172526
   |texte=   Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Merge/RBID.i   -Sk "pubmed:25247892" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Merge/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024