Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases
Identifieur interne : 000210 ( Ncbi/Checkpoint ); précédent : 000209; suivant : 000211Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases
Auteurs : Tony ReesSource :
- PLoS ONE [ 1932-6203 ] ; 2014.
Abstract
Misspellings of organism scientific names create barriers to optimal storage and organization of biological data, reconciliation of data stored under different spelling variants of the same name, and appropriate responses from user queries to taxonomic data systems. This study presents an analysis of the nature of the problem from first principles, reviews some available algorithmic approaches, and describes Taxamatch, an improved name matching solution for this information domain. Taxamatch employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms
Url:
DOI: 10.1371/journal.pone.0107510
PubMed: 25247892
PubMed Central: 4172526
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: 000181
- to stream Pmc, to step Curation: 000181
- to stream Pmc, to step Checkpoint: 000036
- to stream Ncbi, to step Merge: 000210
- to stream Ncbi, to step Curation: 000210
Links to Exploration step
PMC:4172526Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases</title>
<author><name sortKey="Rees, Tony" sort="Rees, Tony" uniqKey="Rees T" first="Tony" last="Rees">Tony Rees</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">25247892</idno>
<idno type="pmc">4172526</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4172526</idno>
<idno type="RBID">PMC:4172526</idno>
<idno type="doi">10.1371/journal.pone.0107510</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000181</idno>
<idno type="wicri:Area/Pmc/Curation">000181</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000036</idno>
<idno type="wicri:Area/Ncbi/Merge">000210</idno>
<idno type="wicri:Area/Ncbi/Curation">000210</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000210</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases</title>
<author><name sortKey="Rees, Tony" sort="Rees, Tony" uniqKey="Rees T" first="Tony" last="Rees">Tony Rees</name>
<affiliation><nlm:aff id="aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint><date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>Misspellings of organism scientific names create barriers to optimal storage and organization of biological data, reconciliation of data stored under different spelling variants of the same name, and appropriate responses from user queries to taxonomic data systems. This study presents an analysis of the nature of the problem from first principles, reviews some available algorithmic approaches, and describes Taxamatch, an improved name matching solution for this information domain. Taxamatch employs a custom Modified Damerau-Levenshtein Distance algorithm in tandem with a phonetic algorithm, together with a rule-based approach incorporating a suite of heuristic filters, to produce improved levels of recall, precision and execution time over the existing dynamic programming algorithms <italic>n</italic>
-grams (as bigrams and trigrams) and standard edit distance. Although entirely phonetic methods are faster than Taxamatch, they are inferior in the area of recall since many real-world errors are non-phonetic in nature. Excellent performance of Taxamatch (as recall, precision and execution time) is demonstrated against a reference database of over 465,000 genus names and 1.6 million species names, as well as against a range of error types as present at both genus and species levels in three sets of sample data for species and four for genera alone. An ancillary authority matching component is included which can be used both for misspelled names and for otherwise matching names where the associated cited authorities are not identical.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Patterson, Dj" uniqKey="Patterson D">DJ Patterson</name>
</author>
<author><name sortKey="Cooper, J" uniqKey="Cooper J">J Cooper</name>
</author>
<author><name sortKey="Kirk, Pm" uniqKey="Kirk P">PM Kirk</name>
</author>
<author><name sortKey="Pyle, Rl" uniqKey="Pyle R">RL Pyle</name>
</author>
<author><name sortKey="Remsen, Dp" uniqKey="Remsen D">DP Remsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hernandez, Ma" uniqKey="Hernandez M">MA Hernández</name>
</author>
<author><name sortKey="Stolfo, Sj" uniqKey="Stolfo S">SJ Stolfo</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Carvalho, Gh" uniqKey="Carvalho G">GH Carvalho</name>
</author>
<author><name sortKey="Cianciaruso, Mv" uniqKey="Cianciaruso M">MV Cianciaruso</name>
</author>
<author><name sortKey="Batalha, Ma" uniqKey="Batalha M">MA Batalha</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Cayuela, L" uniqKey="Cayuela L">L Cayuela</name>
</author>
<author><name sortKey="Granzow De La Cerda, I" uniqKey="Granzow De La Cerda I">I Granzow-de la Cerda</name>
</author>
<author><name sortKey="Albuquerque, Fs" uniqKey="Albuquerque F">FS Albuquerque</name>
</author>
<author><name sortKey="Golicher, Dj" uniqKey="Golicher D">DJ Golicher</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kluyver, Ta" uniqKey="Kluyver T">TA Kluyver</name>
</author>
<author><name sortKey="Osborne, Cp" uniqKey="Osborne C">CP Osborne</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Hall, Pav" uniqKey="Hall P">PAV Hall</name>
</author>
<author><name sortKey="Dowling, Gr" uniqKey="Dowling G">GR Dowling</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Navarro, G" uniqKey="Navarro G">G Navarro</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Porter, Eh" uniqKey="Porter E">EH Porter</name>
</author>
<author><name sortKey="Winkler, We" uniqKey="Winkler W">WE Winkler</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Raup, Dm" uniqKey="Raup D">DM Raup</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="David, N" uniqKey="David N">N David</name>
</author>
<author><name sortKey="Gosselin, M" uniqKey="Gosselin M">M Gosselin</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kukich, K" uniqKey="Kukich K">K Kukich</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Banks, Rc" uniqKey="Banks R">RC Banks</name>
</author>
<author><name sortKey="Cicero, C" uniqKey="Cicero C">C Cicero</name>
</author>
<author><name sortKey="Dunn, Jl" uniqKey="Dunn J">JL Dunn</name>
</author>
<author><name sortKey="Kratter, Aw" uniqKey="Kratter A">AW Kratter</name>
</author>
<author><name sortKey="Rasmussen, Pc" uniqKey="Rasmussen P">PC Rasmussen</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Damerau, F" uniqKey="Damerau F">F Damerau</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Levenshtein, V" uniqKey="Levenshtein V">V Levenshtein</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wagner, Ra" uniqKey="Wagner R">RA Wagner</name>
</author>
<author><name sortKey="Fischer, Mj" uniqKey="Fischer M">MJ Fischer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Lowrance, R" uniqKey="Lowrance R">R Lowrance</name>
</author>
<author><name sortKey="Wagner, R" uniqKey="Wagner R">R Wagner</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Robertson, Am" uniqKey="Robertson A">AM Robertson</name>
</author>
<author><name sortKey="Willett, P" uniqKey="Willett P">P Willett</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Yannakoudakis, Ej" uniqKey="Yannakoudakis E">EJ Yannakoudakis</name>
</author>
<author><name sortKey="Fawthrop, D" uniqKey="Fawthrop D">D Fawthrop</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yannakoudakis, Ej" uniqKey="Yannakoudakis E">EJ Yannakoudakis</name>
</author>
<author><name sortKey="Fawthrop, D" uniqKey="Fawthrop D">D Fawthrop</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Deorowicz, S" uniqKey="Deorowicz S">S Deorowicz</name>
</author>
<author><name sortKey="Ciura, Mg" uniqKey="Ciura M">MG Ciura</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Senger, C" uniqKey="Senger C">C Senger</name>
</author>
<author><name sortKey="Kaltschmidt, J" uniqKey="Kaltschmidt J">J Kaltschmidt</name>
</author>
<author><name sortKey="Schmitt, Spw" uniqKey="Schmitt S">SPW Schmitt</name>
</author>
<author><name sortKey="Pruszydlo, Mg" uniqKey="Pruszydlo M">MG Pruszydlo</name>
</author>
<author><name sortKey="Haefeli, We" uniqKey="Haefeli W">WE Haefeli</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Zobel, J" uniqKey="Zobel J">J Zobel</name>
</author>
<author><name sortKey="Dart, P" uniqKey="Dart P">P Dart</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Schulz, Ku" uniqKey="Schulz K">KU Schulz</name>
</author>
<author><name sortKey="Mihov, S" uniqKey="Mihov S">S Mihov</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Froese, R" uniqKey="Froese R">R Froese</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Boyle, B" uniqKey="Boyle B">B Boyle</name>
</author>
<author><name sortKey="Hopkins, N" uniqKey="Hopkins N">N Hopkins</name>
</author>
<author><name sortKey="Lu, Z" uniqKey="Lu Z">Z Lu</name>
</author>
<author><name sortKey="Garay, Jar" uniqKey="Garay J">JAR Garay</name>
</author>
<author><name sortKey="Mozzherin, D" uniqKey="Mozzherin D">D Mozzherin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations><list></list>
<tree><noCountry><name sortKey="Rees, Tony" sort="Rees, Tony" uniqKey="Rees T" first="Tony" last="Rees">Tony Rees</name>
</noCountry>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Ncbi/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000210 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Checkpoint/biblio.hfd -nk 000210 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Ncbi |étape= Checkpoint |type= RBID |clé= PMC:4172526 |texte= Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Checkpoint/RBID.i -Sk "pubmed:25247892" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Checkpoint/biblio.hfd \ | NlmPubMed2Wicri -a OcrV1
This area was generated with Dilib version V0.6.32. |