Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000216 ( Pmc/Corpus ); précédent : 0002159; suivant : 0002170 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Improved data retrieval from TreeBASE via taxonomic and linguistic data enrichment</title>
<author>
<name sortKey="Anwar, Nadia" sort="Anwar, Nadia" uniqKey="Anwar N" first="Nadia" last="Anwar">Nadia Anwar</name>
<affiliation>
<nlm:aff id="I1">Faculty of Biomedical and Life Sciences, University of Glasgow, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hunt, Ela" sort="Hunt, Ela" uniqKey="Hunt E" first="Ela" last="Hunt">Ela Hunt</name>
<affiliation>
<nlm:aff id="I2">Computer and Information Sciences, University of Strathclyde, UK</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">19426482</idno>
<idno type="pmc">2685121</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2685121</idno>
<idno type="RBID">PMC:2685121</idno>
<idno type="doi">10.1186/1471-2148-9-93</idno>
<date when="2009">2009</date>
<idno type="wicri:Area/Pmc/Corpus">000216</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Improved data retrieval from TreeBASE via taxonomic and linguistic data enrichment</title>
<author>
<name sortKey="Anwar, Nadia" sort="Anwar, Nadia" uniqKey="Anwar N" first="Nadia" last="Anwar">Nadia Anwar</name>
<affiliation>
<nlm:aff id="I1">Faculty of Biomedical and Life Sciences, University of Glasgow, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hunt, Ela" sort="Hunt, Ela" uniqKey="Hunt E" first="Ela" last="Hunt">Ela Hunt</name>
<affiliation>
<nlm:aff id="I2">Computer and Information Sciences, University of Strathclyde, UK</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Evolutionary Biology</title>
<idno type="eISSN">1471-2148</idno>
<imprint>
<date when="2009">2009</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>TreeBASE, the only data repository for phylogenetic studies, is not being used effectively since it does not meet the taxonomic data retrieval requirements of the systematics community. We show, through an examination of the queries performed on TreeBASE, that data retrieval using taxon names is unsatisfactory.</p>
</sec>
<sec>
<title>Results</title>
<p>We report on a new wrapper supporting taxon queries on TreeBASE by utilising a Taxonomy and Classification Database (TCl-Db) we created. TCl-Db holds merged and consolidated taxonomic names from multiple data sources and can be used to translate hierarchical, vernacular and synonym queries into specific query terms in TreeBASE. The query expansion supported by TCl-Db shows very significant information retrieval quality improvement. The wrapper can be accessed at the URL
<ext-link ext-link-type="uri" xlink:href="http://spira.zoology.gla.ac.uk/app/tbasewrapper.php"></ext-link>
</p>
<p>The methodology we developed is scalable and can be applied to new data, as those become available in the future.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>Significantly improved data retrieval quality is shown for all queries, and additional flexibility is achieved via user-driven taxonomy selection.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Evol Biol</journal-id>
<journal-title>BMC Evolutionary Biology</journal-title>
<issn pub-type="epub">1471-2148</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">19426482</article-id>
<article-id pub-id-type="pmc">2685121</article-id>
<article-id pub-id-type="publisher-id">1471-2148-9-93</article-id>
<article-id pub-id-type="doi">10.1186/1471-2148-9-93</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Database</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Improved data retrieval from TreeBASE via taxonomic and linguistic data enrichment</article-title>
</title-group>
<contrib-group>
<contrib id="A1" corresp="yes" contrib-type="author">
<name>
<surname>Anwar</surname>
<given-names>Nadia</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>n.anwar@udcf.gla.ac.uk</email>
</contrib>
<contrib id="A2" contrib-type="author">
<name>
<surname>Hunt</surname>
<given-names>Ela</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>ela.hunt@cis.strath.ac.uk</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Faculty of Biomedical and Life Sciences, University of Glasgow, UK</aff>
<aff id="I2">
<label>2</label>
Computer and Information Sciences, University of Strathclyde, UK</aff>
<pub-date pub-type="collection">
<year>2009</year>
</pub-date>
<pub-date pub-type="epub">
<day>8</day>
<month>5</month>
<year>2009</year>
</pub-date>
<volume>9</volume>
<fpage>93</fpage>
<lpage>93</lpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2148/9/93"></ext-link>
<history>
<date date-type="received">
<day>28</day>
<month>5</month>
<year>2008</year>
</date>
<date date-type="accepted">
<day>8</day>
<month>5</month>
<year>2009</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2009 Anwar and Hunt; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2009</copyright-year>
<copyright-holder>Anwar and Hunt; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0"></ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</p>
<pmc-comment> Anwar Nadia n.anwar@udcf.gla.ac.uk Improved data retrieval from TreeBASE via taxonomic and linguistic data enrichment 2009BMC Evolutionary Biology 9(1): 93-. (2009)1471-2148(2009)9:1<93>urn:ISSN:1471-2148</pmc-comment>
</license>
</permissions>
<abstract>
<sec>
<title>Background</title>
<p>TreeBASE, the only data repository for phylogenetic studies, is not being used effectively since it does not meet the taxonomic data retrieval requirements of the systematics community. We show, through an examination of the queries performed on TreeBASE, that data retrieval using taxon names is unsatisfactory.</p>
</sec>
<sec>
<title>Results</title>
<p>We report on a new wrapper supporting taxon queries on TreeBASE by utilising a Taxonomy and Classification Database (TCl-Db) we created. TCl-Db holds merged and consolidated taxonomic names from multiple data sources and can be used to translate hierarchical, vernacular and synonym queries into specific query terms in TreeBASE. The query expansion supported by TCl-Db shows very significant information retrieval quality improvement. The wrapper can be accessed at the URL
<ext-link ext-link-type="uri" xlink:href="http://spira.zoology.gla.ac.uk/app/tbasewrapper.php"></ext-link>
</p>
<p>The methodology we developed is scalable and can be applied to new data, as those become available in the future.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>Significantly improved data retrieval quality is shown for all queries, and additional flexibility is achieved via user-driven taxonomy selection.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>Systematics aims to increase our understanding of biological diversity by identifying and classifying organisms and using phylogenies to understand the relationships between organisms. The field has developed very elaborate and sophisticated tools for phylogeny construction, and practitioners have been very active in building new, better and faster algorithms [
<xref ref-type="bibr" rid="B1">1</xref>
,
<xref ref-type="bibr" rid="B2">2</xref>
]. However, this has not been matched with database development for long term access and storage of the phylogenies produced by these algorithms. Although much of the data used in phylogenetic analysis is acquired from databases in other fields, particularly specimen data from museum collections [
<xref ref-type="bibr" rid="B3">3</xref>
] and sequence data [
<xref ref-type="bibr" rid="B2">2</xref>
] such as those available at NCBI [
<xref ref-type="bibr" rid="B4">4</xref>
], the results of phylogenetic analysis are not as easily accessible. Mostly, phylogenetic data are retrieved through literature searches and remain buried in the pages and supplementary material sections of the journals in which they are published. This inaccessibility of data compounds the practicality of its use and limits the full potential of information reuse. Projects such as the Tree of Life [
<xref ref-type="bibr" rid="B5">5</xref>
]
<ext-link ext-link-type="uri" xlink:href="http://www.tolweb.org/tree"></ext-link>
face significant data accessibility issues.</p>
<p>The Tree of Life aims to build a complete phylogenetic tree of the world's biodiversity, and to ultimately describe the history of life on earth. The informatics requirements are considerable, as the available data collections grow in size and complexity. Confronting the information explosion requires creative new approaches to facilitating the use of that information. Finding information in complex data sets becomes increasingly difficult as the data grow, therefore data search and discovery needs to be timely, intuitive and precise. Data retrieval through meaningful queries [
<xref ref-type="bibr" rid="B6">6</xref>
] is paramount to the successful fulfilment of the ever more sophisticated data requirements of the systematics community. A phylogenetic data repository [
<xref ref-type="bibr" rid="B7">7</xref>
] should have a good understanding of the organisms that are represented in the phylogenetic trees and support searches using species and higher taxa names. However, currently this is not the case. TreeBASE [
<xref ref-type="bibr" rid="B8">8</xref>
] is currently the only repository for phylogenetic analyses. Here we show that data retrieval using taxonomic names as query terms is inadequate.</p>
<p>In the GenBank
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/Genbank"></ext-link>
sequence data base, which contains the NCBI taxonomy, a query can be performed to retrieve all insect sequences or all
<italic>Drosophila </italic>
sequences. TreeBASE, however, does not contain a taxonomy and queries selecting all
<italic>Drosophila </italic>
studies or phylogenetic trees for insects are not easily specified. The inclusion of a taxonomic infrastructure within TreeBASE is essential to support such queries.</p>
<p>To address the problem of TreeBASE querying, we designed a taxonomic data warehouse combining taxonomic names and classification data that can be
<italic>superimposed </italic>
on TreeBASE to enable hierarchical and linguistic query expansion. Our hypothesis was that data integration in a warehouse would also provide breadth of coverage for taxon names by combining data from multiple sources.</p>
<p>The rest of this paper is structured as follows. The next section provides background on taxonomy and its uses in systematics. An outline of the user requirements and a description of TCl-Db, the data warehouse built as a taxonomic infrastructure for TreeBASE, and the methods of query expansion are then given. Finally, we show retrieval problems experienced by TreeBASE users through an analysis of the query logs from TreeBASE. We conclude that data retrieval difficulties are in part due to the lack of taxonomic intelligence in TreeBASE, and we demonstrate improved data retrieval based on the use of TCl-Db and the software infrastructure we created, as compared to results delivered by Phylofinder [
<xref ref-type="bibr" rid="B9">9</xref>
].</p>
<sec>
<title>Taxonomy</title>
<p>Taxonomic data are produced by the processes of
<italic>Naming</italic>
, which involves attaching a label to a concept for the purposes of communication, and
<italic>Classification</italic>
, that is arranging similar concepts together for the purpose of organisation. The name provides a handle on the biological organism and the position in the classification provides knowledge of the organism in terms of its similarity to others [
<xref ref-type="bibr" rid="B10">10</xref>
]. This section gives a brief overview of the difficulties users experience when utilising taxonomic data.</p>
<p>The taxonomic classification system is an information storage and retrieval system [
<xref ref-type="bibr" rid="B11">11</xref>
], originally designed to be easily memorised [
<xref ref-type="bibr" rid="B12">12</xref>
]. Taxon names serve two roles; the name represents an organism that was described and named by a taxonomist and the name is also placed in a hierarchy to relate the organism to the tree of life. This duality presents difficulties in the use of taxonomic names. The interdependence between the name and the classification, the fact that names are not necessarily unique to one organism and also that the placement of an organism's name into the hierarchy is not fixed, all complicate the use of taxonomic names for information storage and retrieval. Compounding this is the distributed nature of the data. The taxonomy field uses over 200 information systems
<ext-link ext-link-type="uri" xlink:href="http://data.gbif.org/datasets/"></ext-link>
. This number will continue to grow as herbariums and museums digitise their collections [
<xref ref-type="bibr" rid="B13">13</xref>
] and make their data accessible on the web. Although taxonomy has firmly taken its place as a digital science, data accessibility continues to cause difficulty; with the distribution there is also the heterogeneity of the data and the lack of one all encompassing taxonomic reference. Given that the amount of data is growing and the data is in constant flux, it is unlikely that it will be possible to agree on a 'unitary taxonomy' [
<xref ref-type="bibr" rid="B14">14</xref>
]. However, a single all encompassing data portal is achievable [
<xref ref-type="bibr" rid="B15">15</xref>
], and this challenge is being addressed by GBIF [
<xref ref-type="bibr" rid="B16">16</xref>
] and projects such as the Encyclopaedia of Life [
<xref ref-type="bibr" rid="B17">17</xref>
].</p>
<p>Most taxonomic data systems were developed to meet particular requirements in their use or data scope. Taxonomic data is, by its nature, distributed. The data produced from taxonomic research tends to follow a particular focus, a group such as insects or birds, or a geographical location, or a period in history. There is significant heterogeneity in the data models and storage formats of the databases and the interfaces provided to access the data. The taxonomic community have established the Taxonomic Databases Working Group (TDWG) to address data standards, data integration and interoperability. This effort is beginning to alleviate some of the accessibility and interoperability problems experienced by users [
<xref ref-type="bibr" rid="B18">18</xref>
]. Taxonomic data are also not easily deployed outside the systems in which they are stored. This is due to the nature of taxonomic names. As stated in [
<xref ref-type="bibr" rid="B19">19</xref>
], taxa are not facts like the data in most other databases, instead, taxa are hypotheses which are "proposed, used, modified, and then perhaps discarded, as evidence dictates". The classification of an organism is based on a set of criteria selected by the expert taxonomist. Not only do these criteria change, for example, sequence versus morphology, [
<xref ref-type="bibr" rid="B20">20</xref>
], but also different criteria are used by different taxonomists (different morphological characteristics can be given different weights).</p>
<p>Additional complications arise from the addition of new data as new organisms are discovered, and taxonomic revisions that are made to update existing groups. There can be, at any one time, more than one accepted taxonomic opinion on the name and classification of an organism. This complicates the use of taxon names as search terms, as the meaning of the names can change. For example, in situations where a name has changed for taxonomic reasons, such as
<italic>Diomedea albatrus </italic>
which was changed to
<italic>Phoebastria albatrus </italic>
[
<xref ref-type="bibr" rid="B21">21</xref>
], additional support is needed to recognise that relevant data may be attached to both of these terms. When the user performs a search on
<italic>Phoebastria albatrus</italic>
, should any data associated with
<italic>Diomedea albatrus </italic>
also be returned? Similarly, when a user performs a search on a vernacular term 'short-tailed albatross', is it assumed that the system should translate this term to the appropriate Latin names, i.e.
<italic>Phoebastria albatrus </italic>
and
<italic>Diomedea albatrus</italic>
? Also, when a search is performed on the term Aves, we need to know whether the user requires the NCBI meaning of the term or the ITIS [
<xref ref-type="bibr" rid="B22">22</xref>
] meaning of the term. It is not surprising that at the time of development the TreeBASE developers shelved these taxonomic issues. It is now timely and important to address the taxonomic requirements of TreeBASE, given that the system is in the process of being overhauled by the CIPRES project [
<xref ref-type="bibr" rid="B23">23</xref>
].</p>
<p>CIPRES, CyberInfrastructure for Phylogenetic RESearch have taken over responsibility for TreeBASE and as part of their database research programme, they plan to overhaul the database to enable more complex queries than those currently available in TreeBASE. The new version of TreeBASE is named TreeBASE2 and the published Entity-Relationship model contains a taxon module from which it appears that the taxonomic data will be curated from external data sources. However, the documentation does not suggest that hierarchical queries will be directly supported by the TreeBASE2 schema. In addition to TreeBASE2, the CIPRES project have two other research programmes: algorithms for phylogenetic reconstruction and visualisations; and a modelling programme that aims to build mathematical models that can be used to test phylogenetic reconstructions. The project aims to build a complete infrastructure of data and algorithms for the systematics community.</p>
</sec>
<sec>
<title>Systematics</title>
<p>Like taxonomists, most systematists focus their research on a particular group. For these scientists the taxonomic requirements are fairly manageable, and usually involve the most up-to-date checklists. Most scientists are adept at keeping up-to-date with the literature in their area and for the most part they produce their own data. Some systematics studies, however, go beyond the usual boundaries of collecting data and building trees. Two examples are cospeciation analysis [
<xref ref-type="bibr" rid="B24">24</xref>
] and the study of species richness [
<xref ref-type="bibr" rid="B25">25</xref>
]. A cospeciation study usually follows two taxonomic schemes: one for the host species, and one for the parasites. Parasites are of particular interest in systematics because of the shared history of the host and the parasite [
<xref ref-type="bibr" rid="B24">24</xref>
,
<xref ref-type="bibr" rid="B26">26</xref>
]. The analysis involves comparing the phylogenies of the parasite and the host. These phylogenies either need to be collected from the literature or built from morphological or sequence data. For the data that are collected, literature searches are normally conducted using the species or higher taxa names as the search terms. Similarly, a study of the parasite species richness of a group of organisms also uses two taxonomic schemes and involves collecting data using taxon names as search terms [
<xref ref-type="bibr" rid="B27">27</xref>
]. These examples exemplify that more studies now require gathering, not just previously published data in order to stay up-to-date, but also, data collection for further analysis. Another example, where collecting data is integral to the study, is in building super trees [
<xref ref-type="bibr" rid="B28">28</xref>
,
<xref ref-type="bibr" rid="B29">29</xref>
].</p>
<p>Within super tree analyses, data from several studies are gathered using taxon names as search terms. Once these data are collected, the taxonomic names across these data need to be synonymised. Usually, this is done through one authoritative source, for example, Beck
<italic>et. al</italic>
. [
<xref ref-type="bibr" rid="B30">30</xref>
] used Mammal Species of the World [
<xref ref-type="bibr" rid="B31">31</xref>
]; and Thomas
<italic>et. al</italic>
. [
<xref ref-type="bibr" rid="B32">32</xref>
] used the taxonomy of Sibley and Monroe [
<xref ref-type="bibr" rid="B33">33</xref>
]. Where one such data source exists, this is a simple task, however, the time is approaching when super trees go beyond the use of one taxonomic source [
<xref ref-type="bibr" rid="B5">5</xref>
].</p>
<p>The main use of taxonomic data outside its immediate user community is in information retrieval, as the examples above show. Names are used as the keys to retrieve data [
<xref ref-type="bibr" rid="B34">34</xref>
-
<xref ref-type="bibr" rid="B36">36</xref>
]. Currently, no one taxonomic data provider supports the needs of the systematics community. Despite TreeBASE being the only repository for phylogenetic data, systematists prefer to gather the data they require for their analysis through literature searches. In most cases, once data are retrieved, the search results are examined by eye to determine if they contain the phylogenetic data of interest. Since TreeBASE does not provide a complete phylogenetic data resource, literature searches still have to be performed to ensure thoroughness. Unlike the major sequence databases, phylogenetic tree data does not have to be deposited in a database before it can be published. Currently, the deposition of data in TreeBASE has been voluntary. Also, TreeBASE is not exploited fully because data are difficult to retrieve using search terms that are intuitive to users. Although TreeBASE provides a taxon name search, the returned data are often incomplete. Our hypothesis was that an integrated taxonomic data source could alleviate the problems of using taxonomic names to retrieve data from TreeBASE. Using taxon queries performed on
<ext-link ext-link-type="uri" xlink:href="http://www.treebase.org"></ext-link>
, we show a significant improvement in data retrieval when the same queries were expanded using TCl-Db tables linked to TreeBASE. The following sections describe the taxonomic requirements of TreeBASE, and follow on with a description of TCl-Db, the data warehouse that was developed to meet these needs.</p>
</sec>
<sec>
<title>Taxonomic Requirements of TreeBASE</title>
<p>TreeBASE [
<xref ref-type="bibr" rid="B8">8</xref>
] is a phylogenetic and evolutionary information store containing phylogenies for more than 100,000 taxa. Despite the intrinsic taxonomic content, at design, the developers of TreeBASE purposely excluded taxonomy [
<xref ref-type="bibr" rid="B8">8</xref>
]. The TreeBASE interface
<ext-link ext-link-type="uri" xlink:href="http://www.treebase.org"></ext-link>
supports six query types: author, citation, study accession number, matrix accession number, taxon and structure. The taxon search, however, does not perform adequately, as it does not effectively support higher taxa queries or synonym and vernacular queries.</p>
<p>From a biologist's perspective, the taxon search option does not return the expected results. The query term 'Aves' currently returns 5 studies (S281, S880, S296, S1166, S433). On closer inspection, there are many more studies containing Aves (birds) within TreeBASE, for example the search term
<italic>Gallus </italic>
returns a further 2 studies (S1522, S606) and
<italic>Diomedea </italic>
returns 1 more study (S351). Similarly, the search term
<italic>Puffinus </italic>
returns no studies, however, using the search terms
<italic>Puffinus tenuirostris </italic>
or
<italic>Puffinus gravis</italic>
, the study S714 in which they are located is returned. The species
<italic>Puffinus gravis </italic>
is also contained in the study S351, however, a search using the taxon name is not successful because the node in the tree is labelled 'Puffinus gravis U74354'. These examples show that higher taxa terms such as 'Aves' and
<italic>Puffinus </italic>
are not being expanded to include the scientific names they subsume. Queries performed on TreeBASE return only data where the search term matches
<italic>exactly </italic>
a term contained in the study. As such, the term 'birds', which is the vernacular associated with Aves, returns no data because it is not contained in any study. Similarly, the name
<italic>Phoebastria albatrus</italic>
, does not return the study S714 in which the currently accepted valid name
<italic>Diomedea albatrus</italic>
, exists. The taxonomic content and structure of TreeBASE does not support these queries, as query terms are not expanded to include associated terms and, as a result, only partial results are returned. The current data retrieval options within TreeBASE pose a problem for the research community who commonly use taxonomic names as search terms. The research hypothesis studied in this paper is that data retrieval from TreeBASE can be improved by the inclusion of a taxonomic and linguistic infrastructure (a dictionary of synonyms and vernaculars).</p>
<p>The taxonomic requirements that TreeBASE should support are: 1) search terms should expand to include subordinate terms in the classification if they are higher taxa, 2) vernacular queries should be supported and expand appropriately to include the data linked to the scientific names, and 3) any given query should also expand to include data associated with synonyms and out of date usage of a taxon name. These queries are currently not supported by TreeBASE. The developers of TreeBASE purposely excluded taxonomy [
<xref ref-type="bibr" rid="B8">8</xref>
] because there were too many difficulties for a small development team to overcome. The inclusion of a taxonomic infrastructure still poses several challenges. The distributed nature of taxon names and the many data sources in which these are held is a significant problem, as few sources cover the breadth of taxonomic coverage required by TreeBASE. Also, each taxonomic data source uses a particular classification scheme supporting specific taxonomic opinions. Not only do data sources differ in the content they deliver but, even those with similar content may follow different taxonomic opinions and therefore deliver very different classification schemes.</p>
<p>These challenges may be addressed by combining the content of multiple taxonomic data sources and integrating the data into a form that will enable the taxon query extensions we postulate. TCl-Db, a Taxonomy and Classification Database, was developed to increase the accessibility and transparency of taxonomic data by integrating data from the available data sources. It was designed to provide a taxonomic infrastructure to TreeBASE and supports the queries systematists wish to perform.</p>
</sec>
</sec>
<sec>
<title>Construction and content</title>
<sec>
<title>TCl-Db, a Taxonomy and Classification Database</title>
<p>TCl-Db provides a merged view of taxonomic data through a single point of access. The database integrates taxonomic data from several distributed data sources. Architecturally, it forms a warehouse in which taxonomic names from the prominent taxonomic data sources ITIS [
<xref ref-type="bibr" rid="B22">22</xref>
], NCBI [
<xref ref-type="bibr" rid="B37">37</xref>
] and Sp2000 [
<xref ref-type="bibr" rid="B38">38</xref>
] are replicated and maintained in a common structure. These were selected as data sources because of their data content and the ease of downloading and replicating the data structure. Several Aves Checklists [
<xref ref-type="bibr" rid="B39">39</xref>
-
<xref ref-type="bibr" rid="B43">43</xref>
] were made available to us from the early bird project [
<xref ref-type="bibr" rid="B44">44</xref>
], these were initially added in order to evaluate the potential of TCl-Db for data cleaning. Additional checklists data that were requested were Mammal Species of the World [
<xref ref-type="bibr" rid="B31">31</xref>
] and the taxonomic data from GRIN, Germplasm Resources Information Network [
<xref ref-type="bibr" rid="B45">45</xref>
]. A full list of contributing data sources is given in Table
<xref ref-type="table" rid="T1">1</xref>
.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Summary of data sources. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left">Data Source</td>
<td align="left">Download Date/Version</td>
<td align="right">Data Source Content</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">ITIS</td>
<td align="left">January 2004</td>
<td align="right">413,227</td>
</tr>
<tr>
<td align="left">ITIS</td>
<td align="left">October 2005</td>
<td align="right">400,863</td>
</tr>
<tr>
<td align="left">GRIN</td>
<td align="left">July 2005</td>
<td align="right">94,146</td>
</tr>
<tr>
<td align="left">NCBI</td>
<td align="left">September 2004</td>
<td align="right">273,404</td>
</tr>
<tr>
<td align="left">NCBI</td>
<td align="left">October 2005</td>
<td align="right">346,840</td>
</tr>
<tr>
<td align="left">SP2K</td>
<td align="left">2006 Annual Checklist</td>
<td align="right">1,262,469</td>
</tr>
<tr>
<td align="left">ALGAEBASE</td>
<td align="left">SP2K 2005 Annual Checklist</td>
<td align="right">38,150</td>
</tr>
<tr>
<td align="left">MSOW</td>
<td align="left">July 2005</td>
<td align="right">6,058</td>
</tr>
<tr>
<td colspan="3">
<hr></hr>
</td>
</tr>
<tr>
<td align="center" colspan="3">Aves Checklists from early bird project</td>
</tr>
<tr>
<td align="left">nam980612</td>
<td align="left">1998 [44]</td>
<td align="right">12,034</td>
</tr>
<tr>
<td align="left">American Ornithological Union</td>
<td align="left">1983 [39]</td>
<td align="right">4,936</td>
</tr>
<tr>
<td align="left">American Ornithological Union</td>
<td align="left">1998 [40]</td>
<td align="right">2,755</td>
</tr>
<tr>
<td align="left">Sibley and Monroe</td>
<td align="left">1997 [33]</td>
<td align="right">11,932</td>
</tr>
<tr>
<td align="left">Peters</td>
<td align="left">1987 [42]</td>
<td align="right">11,267</td>
</tr>
<tr>
<td align="left">Clements</td>
<td align="left">2000 [43]</td>
<td align="right">19,305</td>
</tr>
<tr>
<td align="left">Bird_names</td>
<td align="left">IOC World bird names 2006</td>
<td align="right">19,313</td>
</tr>
<tr>
<td align="left">Morony, Bock, and Farrand</td>
<td align="left">1975 [41]</td>
<td align="right">11,455</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Updates have been performed for ITIS and NCBI.</p>
</table-wrap-foot>
</table-wrap>
<p>TCl-Db was designed at the early stages of this project, between 2003 and 2004. A full description of the database design and implementation, and an Entity Relationship Diagram (ERD) can be found at
<ext-link ext-link-type="uri" xlink:href="http://spira.zoology.gla.ac.uk/doc.php"></ext-link>
. The design phase of TCl-Db identified the entities that support the requirements presented at the start of this section. The entities are as follows. A N
<sc>AME</sc>
represents a taxon name. S
<sc>YNONYM</sc>
N
<sc>AME</sc>
is a taxon name that, although once used as a valid name, was replaced with a new valid name. V
<sc>ERNACULAR</sc>
N
<sc>AME</sc>
represents a name used in common language to represent an organism. N
<sc>AME</sc>
S
<sc>OURCE</sc>
represents the data source from which each N
<sc>AME</sc>
entity originated. T
<sc>REE</sc>
represents a classification that can be built based on data from a N
<sc>AME</sc>
S
<sc>OURCE</sc>
and N
<sc>ODES</sc>
represent the structure of the T
<sc>REE</sc>
. The physical database design, implemented using the Oracle database management system [
<xref ref-type="bibr" rid="B46">46</xref>
], is shown in Figure
<xref ref-type="fig" rid="F1">1</xref>
. The many to many relationship between N
<sc>AME</sc>
and N
<sc>AME</sc>
S
<sc>OURCE</sc>
is resolved with an association entity, A
<sc>SSERTION</sc>
. As well as ensuring the taxonomic names in TCl-Db are tightly bound to their data sources, the A
<sc>SSERTION</sc>
entity also increases transparency, by making conflicts and differences between data sources more obvious. This is useful when comparing the composition and data quality of data sources.</p>
<fig position="float" id="F1">
<label>Figure 1</label>
<caption>
<p>
<bold>TCl-Db Database Tables</bold>
. TCl-Db tables represent the database implementation. PK means primary key, FK means foreign key, U stands for a uniqueness constraint, and I indicates an integrity constraint (in the table
<sc>ASSERTION</sc>
there is a check constraint on the column dbsource_id). In database terminology tables are called relations and columns are called attributes, while the other concepts express integrity constraints which guarantee data quality. Here, we use the terms tables and columns when we refer to the physical model which additionally includes a number of materialised views and database functions and procedures. Those are used during database updates, to keep track of unique identifiers and to maintain referential integrity.</p>
</caption>
<graphic xlink:href="1471-2148-9-93-1"></graphic>
</fig>
<p>The design ensures that each taxon name entering the warehouse is tightly linked to its data source and data source classification. This supplements the concept of data provenance [
<xref ref-type="bibr" rid="B47">47</xref>
,
<xref ref-type="bibr" rid="B48">48</xref>
] and is achieved through the attribute dbsource_id. The dbsource_ids are the database identifiers used at the database source, for example the ITIS dbsource_id for Aves is 174371. These identifiers were stored so that they could be used to link back to the original data source.</p>
</sec>
<sec>
<title>Hierarchical Query Support</title>
<p>To support hierarchical queries on TreeBASE, TCl-Db stores
<italic>multiple classifications </italic>
giving users the option to choose which hierarchy to traverse in a query. An example of a hierarchical query is the family name
<italic>Crocodylidae</italic>
. In a hierarchical query this search term would include all the subordinate terms within this family name, i.e., the genera and species names.</p>
<p>TCl-Db supports three forms of hierarchical queries: Nested sets [
<xref ref-type="bibr" rid="B49">49</xref>
], Materialized Paths [
<xref ref-type="bibr" rid="B50">50</xref>
] and Oracle's 'Connect By' [
<xref ref-type="bibr" rid="B51">51</xref>
]. The calculation of the Nested set and Materialized Path data is depicted in Figure
<xref ref-type="fig" rid="F2">2</xref>
. Figure
<xref ref-type="fig" rid="F2">2a</xref>
is an example hierarchy with the nodes numerically labelled. The same tree is depicted in Figure
<xref ref-type="fig" rid="F2">2b</xref>
, with the nodes labelled with their materialised path and Figure
<xref ref-type="fig" rid="F2">2c</xref>
is the summativity representation. Nested sets (Figure
<xref ref-type="fig" rid="F2">2c</xref>
) represent a tree using two numbers a left_id and a right_id, the columns left_id and right_id in table N
<sc>ODE</sc>
(see Figure
<xref ref-type="fig" rid="F1">1</xref>
). These left_id and right_ids (nested sets) are calculated using the summativity representation given in Figure
<xref ref-type="fig" rid="F2">2c</xref>
. For example, Nodes 10 and 11 are contained within Node 4 which is contained within Node 1. The nested sets reflect this containment, Node 1 having the largest (most inclusive) set of 1, 22. Node 4 has the set (16, 21) which, includes its children Nodes 10 (17,18) and 11 (19,20). The hierarchical query to select all children of node 4 is a simple numerical calculation, see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
(Query 1) for an example SQL query using the nested set left_id and right_id. This query uses the function
<sc>GET</sc>
_
<sc>NAME</sc>
_
<sc>TEXT</sc>
which for a given taxon, returns the children of that taxon within the specified hierarchy.</p>
<fig position="float" id="F2">
<label>Figure 2</label>
<caption>
<p>
<bold>Nested set and Path representation of a tree</bold>
. The directed acyclic graph given in (a) is represented as Materialised paths in (b). The nested sets are shown in (c) using a summativity representation instead of the traditional Tree representation. This representation gives a clearer view of the containment property of hierarchies.</p>
</caption>
<graphic xlink:href="1471-2148-9-93-2"></graphic>
</fig>
<p>The materialized paths are calculated through a tree walk where a count is incremented as a node is encountered within each level and a new count is created when moving down a level. For example, the root of the tree, the uppermost level, has the path
<bold>1</bold>
/ and the level below inherits this root path and an additional count reflecting their position below the root. For the two nodes below the root, the path
<bold>1/1 </bold>
is given to Node 2 and
<bold>1/2 </bold>
to Node 4. Nodes 10 and 11 are a level below Node 4 and gain their parent path
<bold>1/2 </bold>
and a new count indicating their location within their parent path thus giving them the paths
<bold>1/2/1 </bold>
and
<bold>/1/2/2</bold>
, and so on. Materialised paths are stored in the N
<sc>ODE</sc>
table in the column path as shown in Figure
<xref ref-type="fig" rid="F1">1</xref>
, (see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Query 2, for an example SQL query using materialised paths). The SQL query uses the property that each node inherits its parent path, therefore all children of a node can be selected based on its path being a prefix of the path of its parent. This query uses an additional function
<sc>GET</sc>
_
<sc>ID</sc>
which returns name_id for a given name, simplifying the query so that it does not require any table joins.</p>
<p>Finally, columns name_id and parent_name_id in the table N
<sc>ODE</sc>
are used by the 'Connect By' clause (see Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
, Query 3). This method uses the hierarchical relationship modelled as a self-referencing relation. This is the simplest method of modelling the hierarchical relationship between nodes, however, the 'Connect By' clause is specific to Oracle. The addition of the nested sets and materialized paths makes the database portable to other database management systems such as MySQL or PostgreSQL.</p>
</sec>
<sec>
<title>Vernacular Queries and Query Expansion Techniques</title>
<p>Within TCl-Db synonym names are linked to valid names via the table S
<sc>YNONYM</sc>
_N
<sc>AME</sc>
and vernaculars are linked to valid names through the table V
<sc>ERNACULARS</sc>
. This supports query expansion of synonyms and vernaculars to Latin names. An example query for the term 'crocodiles' is shown in Additional file
<xref ref-type="supplementary-material" rid="S1">1</xref>
as Query 4.</p>
</sec>
</sec>
<sec>
<title>Utility</title>
<p>TCl-Db was used to test the following hypothesis:
<italic>Data retrieval using taxonomic search terms in TreeBASE can be significantly improved by using a data warehouse of integrated taxonomic names and their classifications</italic>
.</p>
<sec>
<title>Data Sets</title>
<p>The data sets used in this study are summarised in Table
<xref ref-type="table" rid="T2">2</xref>
. The upper section of Table
<xref ref-type="table" rid="T2">2</xref>
refers to data in the databases TCl-Db and TreeBASE. The lower section of Table
<xref ref-type="table" rid="T2">2</xref>
refers to data from the TreeBASE query log and the AOL query log. We see within this table that, 29,035 TreeBASE taxa (within the local version of TreeBASE database) were mapped to TCl-Db taxa, and the number of taxon queries from the TreeBASE query log that mapped to the data within TCl-Db were 27,239.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>Summary of data sets used. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left" colspan="7">Summary of Data Sets</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Database</td>
<td align="right">Taxa</td>
<td align="right">Mapped to TCl-Db</td>
<td align="right">Valid names</td>
<td align="right">Vernaculars</td>
<td align="right">Synonyms</td>
<td align="right">Query Date</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td align="left">TCl-Db</td>
<td align="right">1,434,846</td>
<td align="right">1,434,846</td>
<td align="right">916,402</td>
<td align="right">213,602</td>
<td align="right">304,842</td>
<td align="right">01/2006</td>
</tr>
<tr>
<td align="left">TreeBASE</td>
<td align="right">56,712</td>
<td align="right">29,035</td>
<td align="right">27,638</td>
<td align="right">540</td>
<td align="right">856</td>
<td align="right">04/2006</td>
</tr>
<tr>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td align="left">Query Log</td>
<td align="right">Queries</td>
<td align="right">Mapped to TCl-Db</td>
<td align="right">Valid names</td>
<td align="right">Vernaculars</td>
<td align="right">Synonyms</td>
<td align="right">Download Date</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td align="left">TreeBASE</td>
<td align="right">62,126</td>
<td align="right">27,239</td>
<td align="right">17,006</td>
<td align="right">4624</td>
<td align="right">1010</td>
<td align="right">05/2006</td>
</tr>
<tr>
<td align="left">AOL</td>
<td align="right">9,941,434</td>
<td align="right">8,281</td>
<td align="right">3,076</td>
<td align="right">3,590</td>
<td align="right">307</td>
<td align="right">10/2006</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The upper section summarises the taxonomic data content for the TCl-Db data warehouse and the taxonomic data content of our local copy of TreeBASE. The lower section summarises the taxonomic data content for the TreeBASE query log and the AOL query log.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Data Retrieval from TreeBASE</title>
<sec>
<title>TreeBASE Taxon Search Log</title>
<p>The TreeBASE web interface, available at the URL
<ext-link ext-link-type="uri" xlink:href="http://www.treebase.org"></ext-link>
, allows users to conduct taxon queries, queries by a specific matrix identifier, study or tree identifier. These queries return the phylogenetic studies that contain the term that was used in the search. In this study the database structure of TreeBASE was replicated locally so that SQL queries could use the tables within both TreeBASE and TCl-Db.</p>
<p>The taxon queries on TreeBASE came from a script given to us by the TreeBASE developers. The script returned all queries performed using the taxon field in the TreeBASE user interface. These queries and the number of times these queries had been performed were loaded into a database table and given unique identifiers. The data were initially trimmed to remove trailing spaces. Duplicates were removed and so were other non taxon searches, such as queries based on TreeBASE identifiers. There were also several searches for study authors which were removed by comparing the queries to the author names stored in TreeBASE. GenBank Accession number queries were also removed from the data set. The remaining 62,126 queries were then mapped to TCl-Db giving 27,239 distinct taxon queries. Using these 27,239 queries, we compare the data returned in response to the queries directly against a local copy of TreeBASE, downloaded in 2006, and through the wrapper software which uses both TreeBase and TCl-Db. The number of queries that do not return any TreeBASE data is significantly higher than the number of queries that do (16,018 against 11,221). Approximately 50% of the queries posed on TreeBASE were higher taxa queries (of rank genus and above) while 28% were species queries. Of the valid name queries posed against TreeBASE, 71% do not return data, with 94% of the vernacular and 85% of the synonym queries also returning no data. This analysis of the query logs shows that users have been experiencing very poor data retrieval.</p>
</sec>
<sec>
<title>TCl-Db hierarchical query expansion improves data retrieval</title>
<p>Tables
<xref ref-type="table" rid="T3">3</xref>
and
<xref ref-type="table" rid="T4">4</xref>
compare the query effectiveness of TreeBASE alone and TreeBASE terms expanded with taxonomy data from TCl-Db with regard to genus queries in Table
<xref ref-type="table" rid="T3">3</xref>
and higher taxa in Table
<xref ref-type="table" rid="T4">4</xref>
. Overall, for 6,622 genera queries that return
<italic>no data </italic>
in TreeBASE, hierarchical query expansion via TCl-Db produces 1,127 trees. The most significant improvement in the number of trees found is seen for 'pinus' (Table
<xref ref-type="table" rid="T3">3</xref>
, from 7 in TreeBASE alone to 123 trees after TCL-DB query expansion) and 'Metazoa' (Table
<xref ref-type="table" rid="T4">4</xref>
from 5 without TCl-Db to 1,014 additional trees while using NCBI taxonomy within TCl-Db).</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption>
<p>Genus Queries.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left">Genus</td>
<td align="right">
<sc>SPECIES</sc>
count in ITIS</td>
<td align="right">
<sc>SPECIES</sc>
count in NCBI</td>
<td align="right">
<sc>SPECIES</sc>
count in Sp2000</td>
<td align="right">T
<sc>REES</sc>
Returned from genus search on TreeBASE</td>
<td align="right">T
<sc>REES</sc>
Returned from species search using TCl-Db</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<italic>Platanus</italic>
</td>
<td align="right">6</td>
<td align="right">5</td>
<td align="right">6</td>
<td align="right">23</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">
<bold>
<italic>Drosophila</italic>
</bold>
</td>
<td align="right">
<bold>378</bold>
</td>
<td align="right">
<bold>43</bold>
</td>
<td align="right">
<bold>2,066</bold>
</td>
<td align="right">
<bold>28</bold>
</td>
<td align="right">
<bold>88</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>Saccharomyces</italic>
</td>
<td align="right">13</td>
<td align="right">62</td>
<td align="right">6</td>
<td align="right">26</td>
<td align="right">73</td>
</tr>
<tr>
<td align="left">
<italic>Homo</italic>
</td>
<td align="right">1</td>
<td align="right">1</td>
<td align="right">1</td>
<td align="right">1</td>
<td align="right">52</td>
</tr>
<tr>
<td align="left">
<italic>Quercus</italic>
</td>
<td align="right">214</td>
<td align="right">89</td>
<td align="right">211</td>
<td align="right">1</td>
<td align="right">5</td>
</tr>
<tr>
<td align="left">
<bold>
<italic>Pinus</italic>
</bold>
</td>
<td align="right">
<bold>62</bold>
</td>
<td align="right">
<bold>66</bold>
</td>
<td align="right">
<bold>57</bold>
</td>
<td align="right">
<bold>7</bold>
</td>
<td align="right">
<bold>123</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>Arabidopsis</italic>
</td>
<td align="right">2</td>
<td align="right">10</td>
<td align="right">2</td>
<td align="right">9</td>
<td align="right">37</td>
</tr>
<tr>
<td align="left">
<italic>Acer</italic>
</td>
<td align="right">21</td>
<td align="right">79</td>
<td align="right">21</td>
<td align="right">7</td>
<td align="right">9</td>
</tr>
<tr>
<td align="left">
<italic>Canis</italic>
</td>
<td align="right">7</td>
<td align="right">10</td>
<td align="right">7</td>
<td align="right">9</td>
<td align="right">29</td>
</tr>
<tr>
<td align="left">
<italic>Pan</italic>
</td>
<td align="right">2</td>
<td align="right">2</td>
<td align="right">2</td>
<td align="right">1</td>
<td align="right">4</td>
</tr>
<tr>
<td align="left">
<italic>Escherichia</italic>
</td>
<td align="right">21</td>
<td align="right">1</td>
<td align="right">7</td>
<td align="right">0</td>
<td align="right">8</td>
</tr>
<tr>
<td align="left">
<italic>Acacia</italic>
</td>
<td align="right">62</td>
<td align="right">160</td>
<td align="right">1,315</td>
<td align="right">0</td>
<td align="right">4</td>
</tr>
<tr>
<td align="left">
<italic>Acorus</italic>
</td>
<td align="right">2</td>
<td align="right">4</td>
<td align="right">2</td>
<td align="right">13</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">
<italic>Phytophthora</italic>
</td>
<td align="right">1</td>
<td align="right">74</td>
<td align="right">58</td>
<td align="right">13</td>
<td align="right">29</td>
</tr>
<tr>
<td align="left">
<italic>Mus</italic>
</td>
<td align="right">38</td>
<td align="right">25</td>
<td align="right">38</td>
<td align="right">28</td>
<td align="right">30</td>
</tr>
<tr>
<td align="left">
<italic>Bacillus</italic>
</td>
<td align="right">1</td>
<td align="right">1,450</td>
<td align="right">150</td>
<td align="right">1</td>
<td align="right">5</td>
</tr>
<tr>
<td align="left">
<italic>Magnolia</italic>
</td>
<td align="right">12</td>
<td align="right">76</td>
<td align="right">134</td>
<td align="right">8</td>
<td align="right">4</td>
</tr>
<tr>
<td align="left">
<italic>Aspergillus</italic>
</td>
<td align="right">0</td>
<td align="right">155</td>
<td align="right">185</td>
<td align="right">5</td>
<td align="right">43</td>
</tr>
<tr>
<td align="left">
<italic>Fusarium</italic>
</td>
<td align="right">0</td>
<td align="right">183</td>
<td align="right">85</td>
<td align="right">2</td>
<td align="right">19</td>
</tr>
<tr>
<td align="left">
<italic>Tetragnatha</italic>
</td>
<td align="right">0</td>
<td align="right">21</td>
<td align="right">323</td>
<td align="right">6</td>
<td align="right">6</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p> The number of species within each genus for ITIS, NCBI and Sp2000. Each source shows varied species content for each genus, most notably for
<italic>Pinus </italic>
and
<italic>Drosophila</italic>
. The last two columns are: the number of trees returned for the genus queries performed directly on TreeBASE; and the number of additional trees returned using species names found using all three classifications in a hierarchical query in TCl-Db.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption>
<p>Higher Taxa Queries.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left">QUERY</td>
<td align="right">Trees Returned using TreeBASE</td>
<td align="right">Trees Returned using TCl-Db with Sp2000 Hierarchy</td>
<td align="right">Trees Returned using TCl-Db with ITIS Hierarchy</td>
<td align="right">Trees Returned using TCl-Db with NCBI Hierarchy</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Diptera</td>
<td align="right">7</td>
<td align="right">X</td>
<td align="right">111</td>
<td align="right">106</td>
</tr>
<tr>
<td align="left">Lepidoptera</td>
<td align="right">5</td>
<td align="right">41</td>
<td align="right">39</td>
<td align="right">71</td>
</tr>
<tr>
<td align="left">Carnivora</td>
<td align="right">12</td>
<td align="right">49</td>
<td align="right">49</td>
<td align="right">65</td>
</tr>
<tr>
<td align="left">Animalia</td>
<td align="right">1</td>
<td align="right">954</td>
<td align="right">856</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">Solanaceae</td>
<td align="right">9</td>
<td align="right">80</td>
<td align="right">80</td>
<td align="right">80</td>
</tr>
<tr>
<td align="left">Rosaceae</td>
<td align="right">1</td>
<td align="right">42</td>
<td align="right">42</td>
<td align="right">38</td>
</tr>
<tr>
<td align="left">Felidae</td>
<td align="right">7</td>
<td align="right">10</td>
<td align="right">10</td>
<td align="right">15</td>
</tr>
<tr>
<td align="left">Vertebrata</td>
<td align="right">3</td>
<td align="right">0</td>
<td align="right">408</td>
<td align="right">443</td>
</tr>
<tr>
<td align="left">
<bold>Fungi</bold>
</td>
<td align="right">
<bold>8</bold>
</td>
<td align="right">
<bold>807</bold>
</td>
<td align="right">
<bold>389</bold>
</td>
<td align="right">
<bold>814</bold>
</td>
</tr>
<tr>
<td align="left">Crustacea</td>
<td align="right">2</td>
<td align="right">0</td>
<td align="right">47</td>
<td align="right">38</td>
</tr>
<tr>
<td align="left">Chordata</td>
<td align="right">1</td>
<td align="right">433</td>
<td align="right">411</td>
<td align="right">446</td>
</tr>
<tr>
<td align="left">
<bold>Metazoa</bold>
</td>
<td align="right">
<bold>5</bold>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>1,014</bold>
</td>
</tr>
<tr>
<td align="left">Poaceae</td>
<td align="right">11</td>
<td align="right">100</td>
<td align="right">100</td>
<td align="right">95</td>
</tr>
<tr>
<td align="left">Rodentia</td>
<td align="right">9</td>
<td align="right">100</td>
<td align="right">100</td>
<td align="right">102</td>
</tr>
<tr>
<td align="left">Chlorophyceae</td>
<td align="right">6</td>
<td align="right">50</td>
<td align="right">66</td>
<td align="right">50</td>
</tr>
<tr>
<td align="left">Cnidaria</td>
<td align="right">3</td>
<td align="right">75</td>
<td align="right">78</td>
<td align="right">79</td>
</tr>
<tr>
<td align="left">Arthropoda</td>
<td align="right">5</td>
<td align="right">404</td>
<td align="right">284</td>
<td align="right">371</td>
</tr>
<tr>
<td align="left">Primates</td>
<td align="right">7</td>
<td align="right">61</td>
<td align="right">61</td>
<td align="right">61</td>
</tr>
<tr>
<td align="left">Aves</td>
<td align="right">8</td>
<td align="right">91</td>
<td align="right">91</td>
<td align="right">87</td>
</tr>
<tr>
<td align="left">Reptilia</td>
<td align="right">1</td>
<td align="right">74</td>
<td align="right">74</td>
<td align="right">0</td>
</tr>
<tr>
<td align="left">Coleoptera</td>
<td align="right">3</td>
<td align="right">67</td>
<td align="right">45</td>
<td align="right">49</td>
</tr>
<tr>
<td align="left">Cetacea</td>
<td align="right">16</td>
<td align="right">47</td>
<td align="right">17</td>
<td align="right">47</td>
</tr>
<tr>
<td align="left">Bacteria</td>
<td align="right">2</td>
<td align="right">55</td>
<td align="right">13</td>
<td align="right">35</td>
</tr>
<tr>
<td align="left">Ascomycota</td>
<td align="right">9</td>
<td align="right">549</td>
<td align="right">273</td>
<td align="right">540</td>
</tr>
<tr>
<td align="left">
<bold>Archaea</bold>
</td>
<td align="right">
<bold>4</bold>
</td>
<td align="right">
<bold>X</bold>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>15</bold>
</td>
</tr>
<tr>
<td align="left">Mollusca</td>
<td align="right">14</td>
<td align="right">75</td>
<td align="right">86</td>
<td align="right">93</td>
</tr>
<tr>
<td align="left">Mammalia</td>
<td align="right">12</td>
<td align="right">224</td>
<td align="right">212</td>
<td align="right">221</td>
</tr>
<tr>
<td align="left">Fabaceae</td>
<td align="right">11</td>
<td align="right">151</td>
<td align="right">143</td>
<td align="right">151</td>
</tr>
<tr>
<td align="left">Asteraceae</td>
<td align="right">11</td>
<td align="right">127</td>
<td align="right">127</td>
<td align="right">156</td>
</tr>
<tr>
<td align="left">Insecta</td>
<td align="right">2</td>
<td align="right">325</td>
<td align="right">238</td>
<td align="right">301</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p> Expanding query terms hierarchically increases the number of trees returned from TreeBASE. The first column shows the count of trees found in TreeBASE. The remaining columns show the number of trees returned using hierarchical query expansion on TreeBASE using Sp2000, ITIS and NCBI classifications. The table highlights the importance of including more than one hierarchy. For instance, the query 'Metazoa' returns no data when using the ITIS or Sp2000, and 1014 when using NCBI. Also, for 'Fungi' we see that NCBI and ITIS differ. In some cases the hierarchical query failed, denoted with an X. For example, as the term 'Archaea' is both a genus and superkingdom in SP2K, the hierarchical query fails.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>TCl-Db synonym and vernacular query expansion has a positive impact on data retrieval quality</title>
<p>Vernacular queries on TreeBASE perform particularly poorly (Table
<xref ref-type="table" rid="T5">5</xref>
), as most commonly submitted queries return no results, with the exception of query 'primates' which returns two trees. While vernaculars are not the most frequently used search terms, TCl-Db allows these terms to expand to Latin names. For example, 'acacia' (Latin
<italic>Robinia pseudoacacia</italic>
) returns no data in TreeBASE, while the Latin term, related to acacia, returns 2 trees, and 'yeast' (Latin
<italic>Saccharomyces cerevisiae</italic>
) has no direct hits in TreeBASE, but returns 70 trees when TCl-Db is used (a similar observation was made by Jensen et al in [
<xref ref-type="bibr" rid="B6">6</xref>
]). In TCl-Db, the inclusion of the alternative Latin names significantly improves the quality of data retrieval. For those queries that translate to higher taxa names, data retrieval can be further enhanced by performing a hierarchical query.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption>
<p>Vernacular Query terms. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left">Query</td>
<td align="left">TCl-Db Query</td>
<td align="right">TreeBASE alone</td>
<td align="right">TCl-Db with TreeBase</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">maple</td>
<td align="left">Acer</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr>
<td align="left">primates</td>
<td align="left">primata</td>
<td align="right">2</td>
<td align="right">3</td>
</tr>
<tr>
<td align="left">pine</td>
<td align="left">
<italic>Pinus brutia</italic>
</td>
<td align="right">0</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">pine</td>
<td align="left">Pinus</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr>
<td align="left">eubacteria</td>
<td align="left">Bacteria</td>
<td align="right">0</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">mouse</td>
<td align="left">
<italic>Mus musculus</italic>
</td>
<td align="right">0</td>
<td align="right">28</td>
</tr>
<tr>
<td align="left">birds</td>
<td align="left">Aves</td>
<td align="right">0</td>
<td align="right">8</td>
</tr>
<tr>
<td align="left">dog</td>
<td align="left">
<italic>Canis familiaris</italic>
</td>
<td align="right">0</td>
<td align="right">19</td>
</tr>
<tr>
<td align="left">mammals</td>
<td align="left">Mammalia</td>
<td align="right">0</td>
<td align="right">12</td>
</tr>
<tr>
<td align="left">human</td>
<td align="left">
<italic>Homo sapiens</italic>
</td>
<td align="right">0</td>
<td align="right">52</td>
</tr>
<tr>
<td align="left">elm</td>
<td align="left">Ulmus</td>
<td align="right">0</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">
<bold>Acacia</bold>
</td>
<td align="left">
<italic>Parkinsonia aculeata</italic>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>2</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Acacia</bold>
</td>
<td align="left">
<italic>Acacia ampliceps</italic>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>2</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Acacia</bold>
</td>
<td align="left">
<italic>Robinia pseudoacacia</italic>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>10</bold>
</td>
</tr>
<tr>
<td align="left">yeast</td>
<td align="left">
<italic>Saccharomyces cerevisiae</italic>
</td>
<td align="right">0</td>
<td align="right">70</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The most common vernacular queries with Latin names and the number of trees found for each query.</p>
</table-wrap-foot>
</table-wrap>
<p>Expanding search terms with synonyms also improves data retrieval. There were 868 synonym queries that returned no data using TreeBASE. In response to these queries, TCl-Db returned 594 trees by expanding the search term with valid names linked to synonyms.</p>
<p>An alternative query log from AOL [
<xref ref-type="bibr" rid="B52">52</xref>
] was analysed for taxon searches. Taxon searches were extracted from this log for the purposes of providing a test set of queries that can be used to test our TCl-Db TreeBASE wrapper. Surprisingly, from the AOL data we see that vernacular queries were only marginally more frequent than scientific name queries (see Table
<xref ref-type="table" rid="T2">2</xref>
) i.e. 3590 against 3076 out of the 8281 AOL taxon queries.</p>
</sec>
<sec>
<title>TCl-Db Provides Taxonomic Awareness for TreeBASE</title>
<p>The lack of taxonomic content in TreeBASE is responsible for poor data retrieval. Previous studies have also highlighted this. The taxon names in a 2004 snapshot of TreeBASE were mapped previously to the databases IPNI, ITIS, NCBI, and uBio in TBmap [
<xref ref-type="bibr" rid="B53">53</xref>
], and this work comments on the importance of internal consistency within a database system and the requirement for data validation. TCl-Db can also be used for this purpose and part of that analysis was replicated here in an
<italic>automated way</italic>
. Through SQL queries we mapped 28,876 TreeBASE taxa to taxa in TCl-Db. The distribution of TreeBASE names, grouped by taxonomic rank, is shown in Table
<xref ref-type="table" rid="T6">6</xref>
. This shows that the majority of TreeBASE names are species, while the majority of queries performed on TreeBASE are higher taxa. It is not surprising, therefore, that data retrieval is poor. The lack of taxonomic support in TreeBASE means that queries do not return data because the query terms are not understood by the system. One way to improve this, as shown above, is to increase 'the vocabulary' of the database. The superimposition of a taxonomy onto the TreeBASE structure makes sure the queries are understood by the system and makes it significantly more user friendly.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption>
<p>Proportion of Higher Taxa Queries within TreeBASE Query log. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td></td>
<td align="right">TreeBASE database</td>
<td align="right">TreeBASE Query Log</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Subspecies</td>
<td align="right">218</td>
<td align="right">145</td>
</tr>
<tr>
<td align="left">Species</td>
<td align="right">23,105</td>
<td align="right">7,781</td>
</tr>
<tr>
<td align="left">Higher Taxa</td>
<td align="right">5,086</td>
<td align="right">13,558</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>TreeBASE taxon content within the TreeBASE taxon query log. The difference between the distribution of taxon names in TreeBASE and the TreeBASE query log is large. The vast majority of taxa in TreeBASE are species (left) while the types of queries performed on TreeBASE concern higher taxa.</p>
</table-wrap-foot>
</table-wrap>
<p>Although a number of integrated database systems already exist and store names from multiple sources, the classifications of those names are not stored and a user cannot freely choose the classification that suits their work best. TCl-Db was developed because Sp2000 and uBio could not meet the requirements we gathered. The specific shortcoming of Sp2000 was that it did not support multiple classifications, while uBio could not effectively link to TreeBASE. However, uBio extended its services to include classifications [
<xref ref-type="bibr" rid="B54">54</xref>
] which is accessible only through a web service.</p>
<p>TCl-Db supports a number of
<italic>novel functions </italic>
not included within other systems. First, it performs hierarchical searches through a choice of three classifications. Providing a higher taxon name as a query returns names contained within the hierarchy. Second, it expands terms (with synonyms and vernaculars) to include valid names that are associated with them. These queries are similar to 'drill down' browsing searches and 'fuzzy' queries using generalised terms. These queries are supported by a local copy of TreeBASE accessed through a web based wrapper.</p>
<p>The interface to TCl-Db provides both a search form (Figure
<xref ref-type="fig" rid="F3">3</xref>
) and a classification browse page (Figure
<xref ref-type="fig" rid="F4">4</xref>
) which returns either TreeBASE
<italic>treeids </italic>
or
<italic>studyids </italic>
which link to the current online TreeBASE interface via hyper links. The web interface enables the user to enter vernacular names as search terms. These searches return a list of linked taxon names from which the user can select. For example, entering the search term 'birds' will return a link to the term 'Aves'. The search form also enables the user to use an approximate spelling, as in Google's 'did you mean' link. For example, the search term 'Caenorabditis' returns no data but suggests 'Caenorhabditis' as an alternative. Hierarchical queries are also supported. Once a search term is entered, the system returns a list of classifications. Once a classification is selected, the query expands to subordinate terms within the classification and each term is searched through TreeBASE. Additionally, a browse function is supported. It allows the users to first select which hierarchy they wish to browse (ITIS, NCBI or Sp2000), and then select the taxon for which they want to retrieve data.</p>
<fig position="float" id="F3">
<label>Figure 3</label>
<caption>
<p>
<bold>TreeBase Wrapper – Search Page</bold>
. This page can be accessed from the URL
<ext-link ext-link-type="uri" xlink:href="http://spira.zoology.gla.ac.uk/app/tbase_wrapper.php"></ext-link>
. In response to the query 'Streptococcus', TCl-Db wrapper returns three distinct taxa present in four trees (left pane). The right pane shown shows taxa details for 'Streptococcus' from the data sources included in TCl-Db.</p>
</caption>
<graphic xlink:href="1471-2148-9-93-3"></graphic>
</fig>
<fig position="float" id="F4">
<label>Figure 4</label>
<caption>
<p>
<bold>TreeBase Wrapper – Browse Page</bold>
. This page can be accessed from the URL
<ext-link ext-link-type="uri" xlink:href="http://spira.zoology.gla.ac.uk/app/browse.php"></ext-link>
. The NCBI hierarchy is traversed to 'Lactobacillales', which returns 4 distinct trees (M1498, M2480, M2478 and M2476). The query is started by selecting the classification using the select boxes in the top left, the choices are ITIS, NCBI and SP2K. The hierarchy is traversed with a single mouse click through each level as it appears. A double click on a taxon name triggers a TCl-Db query through TreeBASE.</p>
</caption>
<graphic xlink:href="1471-2148-9-93-4"></graphic>
</fig>
</sec>
</sec>
</sec>
<sec>
<title>Discussion</title>
<p>The version of TreeBASE on which this analysis was based is to be replaced by CIPRES as TreeBASE2 [
<xref ref-type="bibr" rid="B23">23</xref>
]. Although a prototype was due for release in July 2006, it is not available yet. The new improved TreeBASE schema has a Taxon module which looks to rectify many of the data retrieval issues currently experienced by users. It is difficult to see from the available documentation and schema exactly how hierarchical and vernacular queries will be supported in TreeBASE2, and until the system comes online, our web application makes clear the advantages of supporting taxon queries, and the benefits of query expansion.</p>
<p>Phylofinder [
<xref ref-type="bibr" rid="B9">9</xref>
] also shows how data retrieval can be improved with the inclusion of a taxonomy. It uses the NCBI classification and makes use of TBmap [
<xref ref-type="bibr" rid="B53">53</xref>
] to deal with taxa names that are not included in NCBI. On the whole, Phylofinder does improve data retrieval, however, the inclusion of just one classification limits the higher taxon queries that can be performed to only those included in NCBI and TBmap. Table
<xref ref-type="table" rid="T7">7</xref>
shows a selection of higher taxa terms from the ITIS classification, and shows that data retrieval in Phylofinder is still limited, as for instance the query 'Craspedomonadales' returns no hits in Phylofinder and 35 when TCl-Db is used, and 'Pinales', with no hits in Phylofinder, brings 37 trees when routed via TCl-Db. This is partly due to the fact that TBmap has a restricted scope, as not only is the mapping based on a 2004 snapshot of TreeBASE, but also the mappings are limited to the taxa contained in TreeBASE. As a result, many higher taxa queries are not well supported. Although TCl-Db, uses a 2006 snapshot of TreeBASE it is only marginally outperformed by Phylofinder which uses a more recent version of TreeBASE. The queries 'Aves' and 'Puffinus', exemplified originally, return 1 more tree and 6 more trees respectively in Phylofinder. The inclusion of more than one classification scheme and the support for vernacular queries make the approach used by TCl-Db superior to that used by Phylofinder. Phylofinder is based on mappings that are already out of date, therefore, its shelf life is limited, whereas TCl-Db performs mappings to TreeBASE automatically, and, therefore, will be able to provide a more useful resource in the long term.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption>
<p>ITIS higher taxa queries in Phylofinder and TCl-Db. </p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td align="left">Query</td>
<td align="right">Phylofinder (trees found)</td>
<td align="right">TCl-Db (trees found)</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Aristolochiales</td>
<td align="right">0</td>
<td align="right">7</td>
</tr>
<tr>
<td align="left">Bromeliales</td>
<td align="right">1</td>
<td align="right">16</td>
</tr>
<tr>
<td align="left">Calycerales</td>
<td align="right">0</td>
<td align="right">4</td>
</tr>
<tr>
<td align="left">Schistostegiales</td>
<td align="right">0</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">Aulacoseirales</td>
<td align="right">0</td>
<td align="right">1</td>
</tr>
<tr>
<td align="left">Centrales</td>
<td align="right">0</td>
<td align="right">5</td>
</tr>
<tr>
<td align="left">Chromalinales</td>
<td align="right">0</td>
<td align="right">4</td>
</tr>
<tr>
<td align="left">
<bold>Craspedomonadales</bold>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>35</bold>
</td>
</tr>
<tr>
<td align="left">Leitneriales</td>
<td align="right">0</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">Lithodesmiales</td>
<td align="right">0</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">Plumbaginales</td>
<td align="right">0</td>
<td align="right">10</td>
</tr>
<tr>
<td align="left">Polygalales</td>
<td align="right">1</td>
<td align="right">14</td>
</tr>
<tr>
<td align="left">Hydrocharitales</td>
<td align="right">0</td>
<td align="right">3</td>
</tr>
<tr>
<td align="left">
<bold>Pinales</bold>
</td>
<td align="right">
<bold>0</bold>
</td>
<td align="right">
<bold>37</bold>
</td>
</tr>
<tr>
<td align="left">Eriocaulales</td>
<td align="right">0</td>
<td align="right">12</td>
</tr>
<tr>
<td align="left">Fissidentales</td>
<td align="right">0</td>
<td align="right">6</td>
</tr>
<tr>
<td align="left">Papaverales</td>
<td align="right">0</td>
<td align="right">8</td>
</tr>
<tr>
<td align="left">Cryptonemiales</td>
<td align="right">0</td>
<td align="right">2</td>
</tr>
<tr>
<td align="left">Biddulphiales</td>
<td align="right">1</td>
<td align="right">3</td>
</tr>
<tr>
<td align="left">Restionales</td>
<td align="right">0</td>
<td align="right">8</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>A sample of ITIS higher taxa in the Plant Kingdom with the number of trees returned in Phylofinder and TCl-Db. Terms Pinales and Craspedomonadales, in bold, return large numbers of hits in TCl-Db.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Future work</title>
<sec>
<title>Data Freshness</title>
<p>One of the challenges is data maintenance within TCl-Db. Even though the system was developed with TreeBASE as the primary source of phylogeny data, there may be other database systems that could benefit from the inclusion of a taxonomy. We need to keep data sets current for the system to be useful in the long term and to other consumers. Updates to NCBI and ITIS classifications have been performed manually and the process has highlighted maintenance issues that need to be addressed to support automated updates which would keep the data current. This is the focus of current work. Currently data updates are performed when requested, we endeavour to update the ITIS and NCBI data at least yearly. The addition of new or updated checklist data can be added on request.</p>
</sec>
<sec>
<title>Semantic Web Technologies</title>
<p>The core of TCl-Db work is data integration. From the database perspective, data warehousing and data integration [
<xref ref-type="bibr" rid="B55">55</xref>
] involve gathering data from several silos and mapping those into a common schema. Integration is achieved by issuing queries on this common structure. On the web, however, data are not integrated physically but are linked using URLs, which provides a certain degree of flexible adjustment, as sources evolve. In the next generation of the web, resources, given correct meta-data [
<xref ref-type="bibr" rid="B56">56</xref>
], could be linked automatically via ontological annotations [
<xref ref-type="bibr" rid="B57">57</xref>
]. Semantically annotated data will have meaning to computers and not just to the users browsing them [
<xref ref-type="bibr" rid="B58">58</xref>
], which enables automatic data matching, integration and translation. Semantic web technologies should be able to support automated linking of phylogenetic and taxonomic resources [
<xref ref-type="bibr" rid="B59">59</xref>
]. Making taxonomic data interoperable [
<xref ref-type="bibr" rid="B60">60</xref>
] would be of great benefit, as it would remove the need for carefully orchestrated updates, which would be replaced by distributed web querying. Also, the distributed nature of systematics lends itself to the semantic web ethos. Potentially, semantic web technologies will reduce the need for data warehousing, and replace the centralised approach to data management with a distributed one [
<xref ref-type="bibr" rid="B61">61</xref>
]. The future development of TCl-Db will make use of semantic technologies for data integration and support greater interoperability of taxonomy and phylogeny systems.</p>
</sec>
</sec>
<sec>
<title>Conclusion</title>
<p>The lack of taxonomic intelligence in TreeBASE makes data retrieval ineffective in some cases. Our hypothesis that data retrieval can be improved through the inclusion of taxonomic meta-data is well substantiated. We clearly show that where TreeBASE finds little data, TCl-Db delivers improved results. TCl-Db provides an infrastructure supporting effective data retrieval within TreeBASE by using taxon names as search terms. The analysis we presented shows the importance of this meta-data in supporting queries found in query logs. Additionally, via the inclusion of vernaculars and synonyms, additional data can be found in TreeBASE. The use of an amalgamated taxonomy data warehouse also addressed the issues of taxonomic coverage and the differing opinions in taxonomy, and supports the comparison of taxonomy and data coverage in several contexts.</p>
</sec>
<sec>
<title>Availability and requirements</title>
<p>The wrapper which expands queries with information from TCl-Db can be accessed at the URL
<ext-link ext-link-type="uri" xlink:href="http://spira.bio.gla.ac.uk/app/tbasewrapper.php"></ext-link>
and has been tested on Mozilla Firefox version 2 and Safari version 3. Database dumps for Oracle and MySQL can be found at
<ext-link ext-link-type="uri" xlink:href="http://spira.zoology.gla.ac.uk/download.php"></ext-link>
.</p>
</sec>
<sec>
<title>Abbreviations</title>
<p>TCl-Db: Taxonomy and Classification Database; ITIS: Integrated Taxonomic Information System; Sp2000: Species 2000; NCBI: National Center for Biotechnology Information; uBio: Universal Biological Indexer and Organiser; AOL: America OnLine; CIPRES: CyberInfrastructure for Phylogenetic RESearch.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>NA was the primary designer and developer of TCl-Db, and wrote the paper. EH provided some suggestions, and edited parts of the manuscript. Both authors read and approved the final version.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional file 1</title>
<p>
<bold>SQL Queries</bold>
. The data provided represent example SQL queries for each of the hierarchical queries (Queries 1 – 3) and for expanding vernaculars to valid names (Query 4).</p>
</caption>
<media xlink:href="1471-2148-9-93-S1.pdf" mimetype="application" mime-subtype="pdf">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<sec>
<title>Acknowledgements</title>
<p>This project was funded by a University of Glasgow PhD scholarship supervised by Rod Page and Ela Hunt. Ela Hunt was funded by an MRC fellowship (2001–2005) and an EU Marie Curie fellowship (2006–2008). Bill Piel of Yale University provided the TreeBASE data dump and query log.</p>
</sec>
</ack>
<ref-list>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>DeSalle</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Giribet</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wheeler</surname>
<given-names>W</given-names>
</name>
</person-group>
<source>Techniques in Molecular Systematics and Evolution</source>
<year>2002</year>
<publisher-name>Basel: Birkhauser</publisher-name>
</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Scotland</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Pennington</surname>
<given-names>T</given-names>
</name>
<collab>(Eds)</collab>
</person-group>
<source>Homology and Systematics: Coding Characters for Phylogenetic Analysis</source>
<year>2000</year>
<publisher-name>Systematics Association Special Volumes, London: Taylor & Francis</publisher-name>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zusi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Wood</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Jenkinson</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Remarks on a World-Wide Inventory of Avian Anatomical Specimens</article-title>
<source>The Auk</source>
<year>1982</year>
<volume>99</volume>
<fpage>740</fpage>
<lpage>757</lpage>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wheeler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Barrett</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Benson</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Canese</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chetvernin</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>D</given-names>
</name>
<name>
<surname>DiCuccio</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Edgar</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Federhen</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Database resources of the National Center for Biotechnology Information</article-title>
<source>Nucleic Acids Research</source>
<year>2007</year>
<fpage>D5</fpage>
<pub-id pub-id-type="pmid">17170002</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cracraft</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Donoghue</surname>
<given-names>M</given-names>
</name>
</person-group>
<source>Assembling the tree of life</source>
<year>2004</year>
<publisher-name>New York: Oxford University Press</publisher-name>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jensen</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Saric</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Literature mining for the biologist: from information retrieval to biological discovery</article-title>
<source>NATURE REVIEWS – GENETICS</source>
<year>2006</year>
<volume>7</volume>
<fpage>119</fpage>
</citation>
</ref>
<ref id="B7">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Nakhleh</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Miranker</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Barbancon</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Requirements of phylogenetic databases</article-title>
<source>Bioinformatics and Bioengineering, 2003 Proceedings Third IEEE Symposium on</source>
<year>2003</year>
<fpage>141</fpage>
<lpage>148</lpage>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morell</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>TreeBASE: the roots of phylogeny</article-title>
<source>Science</source>
<year>1996</year>
<volume>273</volume>
<fpage>569</fpage>
<lpage>0</lpage>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Burleigh</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bansal</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fernandez-Baca</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>PhyloFinder: an intelligent search engine for phylogenetic tree databases</article-title>
<source>BMC Evolutionary Biology</source>
<year>2008</year>
<volume>8</volume>
<fpage>90</fpage>
<pub-id pub-id-type="pmid">18366717</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jeffrey</surname>
<given-names>C</given-names>
</name>
</person-group>
<source>Biological nomenclature</source>
<year>1989</year>
<publisher-name>London: Edward Arnold</publisher-name>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mayr</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Two empires or three?</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>1998</year>
<volume>95</volume>
<fpage>9720</fpage>
<lpage>9723</lpage>
<pub-id pub-id-type="pmid">9707542</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cain</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Logic and memory in Linnaeus system of taxonomy</article-title>
<source>Proceedings of the Linnaean Society of London</source>
<year>1958</year>
<volume>169</volume>
<fpage>144</fpage>
<lpage>163</lpage>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Soberóon</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Peterson</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Biodiversity informatics: managing and applying primary biodiversity data</article-title>
<source>Philosophical Transactions: Biological Sciences</source>
<year>2004</year>
<volume>359</volume>
<fpage>689</fpage>
<lpage>698</lpage>
<pub-id pub-id-type="pmid">15253354</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scoble</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Unitary or unified taxonomy?</article-title>
<source>Philosophical Transactions: Biological Sciences</source>
<year>2004</year>
<volume>359</volume>
<fpage>699</fpage>
<lpage>710</lpage>
<pub-id pub-id-type="pmid">15253355</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kennedy</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Supporting Taxonomic Names in Cell and Molecular Biology Databases</article-title>
<source>Omics A Journal of Integrative Biology</source>
<year>2003</year>
<volume>7</volume>
<fpage>13</fpage>
<lpage>16</lpage>
<pub-id pub-id-type="pmid">12831547</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saarenmaa</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>The Global Biodiversity Information Facility: Architectural and implementation issues</article-title>
<source>European Environment Agency, Technical Reports</source>
<year>1999</year>
<volume>34</volume>
<fpage>34</fpage>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wilson</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>The encyclopedia of life</article-title>
<source>Trends in Ecology and Evolution</source>
<year>2003</year>
<volume>18</volume>
<fpage>77</fpage>
<lpage>80</lpage>
</citation>
</ref>
<ref id="B18">
<citation citation-type="other">
<person-group person-group-type="author">
<collab>of Biological Sciences IU</collab>
</person-group>
<article-title>TDWG – Taxonomic Databases Working Group</article-title>
<year>2006</year>
<ext-link ext-link-type="uri" xlink:href="http://www.tdwg.org/"></ext-link>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thiele</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Yeates</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Tension arises from duality at the heart of taxonomy</article-title>
<source>Nature</source>
<year>2002</year>
<volume>419</volume>
<fpage>337</fpage>
<pub-id pub-id-type="pmid">12353005</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hedges</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sibley</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Molecule vs. Morphology in Avian Evolution: The Case of the "Pelecaniform" Birds</article-title>
<source>PNAS</source>
<year>1994</year>
<volume>91</volume>
<fpage>9861</fpage>
<lpage>9865</lpage>
<ext-link ext-link-type="uri" xlink:href="http://www.pnas.org/cgi/content/abstract/91/21/9861"></ext-link>
<pub-id pub-id-type="pmid">7937906</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Coues</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Critical review of the Family Procellariidae. Part V. Embracing the Diomedeinae and the Halodrominae</article-title>
<source>Proceedings of the Academy of Natural Sciences of Philadelphia</source>
<year>1866</year>
<volume>18</volume>
<fpage>172</fpage>
<lpage>197</lpage>
</citation>
</ref>
<ref id="B22">
<citation citation-type="other">
<person-group person-group-type="author">
<collab>ITIS</collab>
</person-group>
<article-title>Integrated Taxonomic Information System</article-title>
<year>2006</year>
<ext-link ext-link-type="uri" xlink:href="http://www.itis.gov"></ext-link>
</citation>
</ref>
<ref id="B23">
<citation citation-type="other">
<person-group person-group-type="author">
<collab>CIPRES</collab>
</person-group>
<source>Cyberinfrastructure for phylogenetic research</source>
<year>2006</year>
<ext-link ext-link-type="uri" xlink:href="http://www.phylo.org/sub_sections/databases.php"></ext-link>
</citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Page</surname>
<given-names>R</given-names>
</name>
</person-group>
<source>Tangled Trees: Phylogeny, Cospeciation, and Coevolution</source>
<year>2002</year>
<publisher-name>London: University Of Chicago Press</publisher-name>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gaston</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Biodiversity: higher taxon richness</article-title>
<source>Progress in Physical Geography</source>
<year>2000</year>
<volume>24</volume>
<fpage>117</fpage>
<lpage>127</lpage>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hafner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Page</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Molecular Phylogenies and Host-Parasite Cospeciation: Gophers and Lice as a Model System</article-title>
<source>Philosophical Transactions: Biological Sciences</source>
<year>1995</year>
<volume>349</volume>
<fpage>77</fpage>
<lpage>83</lpage>
<pub-id pub-id-type="pmid">8748020</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nunn</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Altizer</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Sechrest</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Comparative Tests of Parasite Species Richness in Primates</article-title>
<source>American Naturalist</source>
<year>2003</year>
<volume>162</volume>
<fpage>597</fpage>
<lpage>614</lpage>
<pub-id pub-id-type="pmid">14618538</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bininda-Emonds</surname>
<given-names>O</given-names>
</name>
<collab>(Ed)</collab>
</person-group>
<source>Phylogenetic Supertrees: Combining Information To Reveal The Tree Of Life</source>
<year>2004</year>
<publisher-name>Dordrecht: Kluwer Academic Publishers</publisher-name>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bininda-Emonds</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>The evolution of supertrees</article-title>
<source>Trends in Ecology And Evolution</source>
<year>2004</year>
<volume>19</volume>
<fpage>315</fpage>
<lpage>322</lpage>
<pub-id pub-id-type="pmid">16701277</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beck</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bininda-Emonds</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Cardillo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Purvis</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>A higher-level MRP supertree of placental mammals</article-title>
<source>BMC Evolutionary Biology</source>
<year>2006</year>
<volume>6</volume>
<fpage>93</fpage>
<pub-id pub-id-type="pmid">17101039</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wilson</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Reeder</surname>
<given-names>D</given-names>
</name>
</person-group>
<source>Mammal Species of the World: A Taxonomic and Geographic Reference</source>
<year>1993</year>
<publisher-name>Baltimore: The Johns Hopkins University Press</publisher-name>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thomas</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wills</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Szffekely</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>A supertree approach to shorebird phylogeny</article-title>
<source>BMC Evol Biol</source>
<year>2004</year>
<volume>4</volume>
<fpage>28</fpage>
<pub-id pub-id-type="pmid">15329156</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Monroe</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Sibley</surname>
<given-names>C</given-names>
</name>
</person-group>
<source>A World Checklist of Birds</source>
<year>1997</year>
<publisher-name>London: Yale University Press</publisher-name>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Garrity</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Lyons</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Future-Prooffing Biological Nomenclature</article-title>
<source>Omics A Journal of Integrative Biology</source>
<year>2003</year>
<volume>7</volume>
<fpage>31</fpage>
<lpage>33</lpage>
<pub-id pub-id-type="pmid">12831553</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Knapp</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>What's in a name?</article-title>
<source>Nature</source>
<year>2000</year>
<volume>408</volume>
<fpage>33</fpage>
<pub-id pub-id-type="pmid">11081489</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Petsko</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>What's in a name?</article-title>
<source>Genome Biol</source>
<year>2002</year>
<volume>3</volume>
<fpage>1</fpage>
<lpage>1005</lpage>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Federhen</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The National Center for Biotechnology Information (NCBI) Taxonomy Database</article-title>
<year>2005</year>
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html"></ext-link>
<pub-id pub-id-type="pmid">15608222</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Bisby</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Species 2000: indexing the worlds known species</article-title>
<year>2000</year>
<ext-link ext-link-type="uri" xlink:href="http://www.Species2000.org"></ext-link>
</citation>
</ref>
<ref id="B39">
<citation citation-type="other">
<person-group person-group-type="author">
<collab>AOU</collab>
</person-group>
<article-title>American Ornithological Union, Check-list of North American Birds</article-title>
<source>Amer Ornithol Union, Lawrence, Kans</source>
<year>1983</year>
<fpage>877</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.aou.org/checklist/north/"></ext-link>
</citation>
</ref>
<ref id="B40">
<citation citation-type="book">
<person-group person-group-type="author">
<collab>AOU</collab>
</person-group>
<article-title>American Ornithological Union, Check-list of North American Birds</article-title>
<source>Amer Ornithol Union</source>
<year>1998</year>
<edition>7</edition>
<publisher-name>Lawrence, Kansas: AOU Press</publisher-name>
<fpage>829</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.aou.org/checklist/north/"></ext-link>
</citation>
</ref>
<ref id="B41">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Morony</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Bock</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Farrand</surname>
<given-names>J</given-names>
</name>
</person-group>
<source>Reference list to the birds of the world-American Museum of Natural History</source>
<year>1975</year>
</citation>
</ref>
<ref id="B42">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Peters</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Checklist of birds of the world</article-title>
<year>1987</year>
<ext-link ext-link-type="uri" xlink:href="http://worldbirdinfo.net/Pages/PetersFamilyList.aspx"></ext-link>
</citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Clements</surname>
<given-names>J</given-names>
</name>
</person-group>
<source>Birds of the world: A checklist</source>
<year>2000</year>
<volume>270</volume>
<publisher-name>Ibis Publishing Company. Vista, California</publisher-name>
</citation>
</ref>
<ref id="B44">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Hackett</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Assembling the tree of life: Early Bird</article-title>
<year>2003</year>
<ext-link ext-link-type="uri" xlink:href="http://www.fieldmuseum.org/research_collections/zoology/zoo_sites/early_bird/"></ext-link>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Perry</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Stoner</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mowder</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Plant germplasm information management system: germplasm resources information network</article-title>
<source>HortScience</source>
<year>1988</year>
<volume>23</volume>
<fpage>57</fpage>
<lpage>60</lpage>
</citation>
</ref>
<ref id="B46">
<citation citation-type="other">
<person-group person-group-type="author">
<collab>Oracle</collab>
</person-group>
<article-title>Oracle 10g: Database</article-title>
<source>Oracle Corporation, Redwood Shores, CA</source>
<year>2006</year>
<ext-link ext-link-type="uri" xlink:href="http://www.oracle.com"></ext-link>
</citation>
</ref>
<ref id="B47">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Buneman</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Khanna</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Data Provenance: Some Basic Issues</article-title>
<source>Fst Tcs 2000: Foundations of Software Technology and Theoretical Computer Science: 20th Conference, New Delhi, India, December 13–15, 2000: Proceedings</source>
<year>2000</year>
</citation>
</ref>
<ref id="B48">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Buneman</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>WC</given-names>
</name>
</person-group>
<article-title>Provenance in databases</article-title>
<source>SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data</source>
<year>2007</year>
<publisher-name>New York, NY, USA: ACM</publisher-name>
<fpage>1171</fpage>
<lpage>1173</lpage>
</citation>
</ref>
<ref id="B49">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Celko</surname>
<given-names>J</given-names>
</name>
</person-group>
<source>Joe Celko's SQL for Smarties: Advanced SQL Programming</source>
<year>1999</year>
<publisher-name>San Francisco: Morgan Kaufmann</publisher-name>
</citation>
</ref>
<ref id="B50">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Tropashko</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Trees in SQL: Nested Sets and Materialized Path</article-title>
<year>2002</year>
<ext-link ext-link-type="uri" xlink:href="http://www.dbazine.com/tropashko4.shtml"></ext-link>
</citation>
</ref>
<ref id="B51">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gennick</surname>
<given-names>J</given-names>
</name>
</person-group>
<source>SQL Pocket Guide, Hierarchical Queries</source>
<year>2006</year>
<publisher-name>Sebastopol: O'Reilly</publisher-name>
<fpage>66</fpage>
<lpage>72</lpage>
</citation>
</ref>
<ref id="B52">
<citation citation-type="other">
<person-group person-group-type="author">
<collab>Americal Online Inc AOI</collab>
</person-group>
<article-title>AOL Search log from 500000 users</article-title>
<year>2006</year>
<ext-link ext-link-type="uri" xlink:href="http://www.gregsadetsky.com/aol-data/"></ext-link>
</citation>
</ref>
<ref id="B53">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Page</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>TBMap: A taxonomic perspective on the phylogenetic database TreeBASE</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<pub-id pub-id-type="pmid">17511869</pub-id>
</citation>
</ref>
<ref id="B54">
<citation citation-type="other">
<person-group person-group-type="author">
<collab>uBio</collab>
</person-group>
<article-title>Universal Biological Indexer and Organizer</article-title>
<year>2006</year>
<ext-link ext-link-type="uri" xlink:href="http://www.ubio.org"></ext-link>
</citation>
</ref>
<ref id="B55">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lacroix</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Critchlow</surname>
<given-names>T</given-names>
</name>
</person-group>
<source>Bioinformatics: managing scientific data</source>
<year>2003</year>
<publisher-name>San Francisco: Morgan Kaufmann</publisher-name>
</citation>
</ref>
<ref id="B56">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Resnik</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Using information content to evaluate semantic similarity in a taxonomy</article-title>
<source>Proceedings of the 14th International Joint Conference on Artificial Intelligence</source>
<year>1995</year>
<volume>1</volume>
<fpage>448</fpage>
<lpage>453</lpage>
</citation>
</ref>
<ref id="B57">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cannata</surname>
<given-names>N</given-names>
</name>
<name>
<surname>M</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Marangoni</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Romano</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>A Semantic Web for bioinformatics: goals, tools, systems, applications</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>s1</fpage>
<pub-id pub-id-type="pmid">18460170</pub-id>
</citation>
</ref>
<ref id="B58">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Amann</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Fundulaki</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Scholl</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Integrating ontologies and thesauri for RDF schema creation and metadata querying</article-title>
<source>International Journal on Digital Libraries</source>
<year>2000</year>
<volume>3</volume>
<fpage>221</fpage>
<lpage>236</lpage>
</citation>
</ref>
<ref id="B59">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Gorlitsky</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>From XML to RDF: how semantic web technologies will change the design of 'omic' standards</article-title>
<source>Nature Biotechnology</source>
<year>2005</year>
<volume>23</volume>
<fpage>1099</fpage>
<lpage>1103</lpage>
<pub-id pub-id-type="pmid">16151403</pub-id>
</citation>
</ref>
<ref id="B60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Williams</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>How to get databases talking the same language</article-title>
<source>Science</source>
<year>1997</year>
<volume>275</volume>
<fpage>301</fpage>
<lpage>2</lpage>
<pub-id pub-id-type="pmid">9005552</pub-id>
</citation>
</ref>
<ref id="B61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stein</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Creating a bioinformatics nation</article-title>
<source>Nature</source>
<year>2002</year>
<volume>417</volume>
<fpage>119</fpage>
<lpage>120</lpage>
<pub-id pub-id-type="pmid">12000935</pub-id>
</citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000216  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000216  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024