Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Ontology-guided data preparation for discovering genotype-phenotype relationships

Identifieur interne : 000022 ( Pmc/Corpus ); précédent : 000021; suivant : 000023

Ontology-guided data preparation for discovering genotype-phenotype relationships

Auteurs : Adrien Coulet ; Malika Smaïl-Tabbone ; Pascale Benlian ; Amedeo Napoli ; Marie-Dominique Devignes

Source :

RBID : PMC:2367630

Abstract

Background

Complexity and amount of post-genomic data constitute two major factors limiting the application of Knowledge Discovery in Databases (KDD) methods in life sciences. Bio-ontologies may nowadays play key roles in knowledge discovery in life science providing semantics to data and to extracted units, by taking advantage of the progress of Semantic Web technologies concerning the understanding and availability of tools for knowledge representation, extraction, and reasoning.

Results

This paper presents a method that exploits bio-ontologies for guiding data selection within the preparation step of the KDD process. We propose three scenarios in which domain knowledge and ontology elements such as subsumption, properties, class descriptions, are taken into account for data selection, before the data mining step. Each of these scenarios is illustrated within a case-study relative to the search of genotype-phenotype relationships in a familial hypercholesterolemia dataset. The guiding of data selection based on domain knowledge is analysed and shows a direct influence on the volume and significance of the data mining results.

Conclusions

The method proposed in this paper is an efficient alternative to numerical methods for data selection based on domain knowledge. In turn, the results of this study may be reused in ontology modelling and data integration.


Url:
DOI: 10.1186/1471-2105-9-S4-S3
PubMed: 18460176
PubMed Central: 2367630

Links to Exploration step

PMC:2367630

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Ontology-guided data preparation for discovering genotype-phenotype relationships</title>
<author>
<name sortKey="Coulet, Adrien" sort="Coulet, Adrien" uniqKey="Coulet A" first="Adrien" last="Coulet">Adrien Coulet</name>
<affiliation>
<nlm:aff id="I1">KIKA Medical, Paris, F-75012, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Smail Tabbone, Malika" sort="Smail Tabbone, Malika" uniqKey="Smail Tabbone M" first="Malika" last="Smaïl-Tabbone">Malika Smaïl-Tabbone</name>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Benlian, Pascale" sort="Benlian, Pascale" uniqKey="Benlian P" first="Pascale" last="Benlian">Pascale Benlian</name>
<affiliation>
<nlm:aff id="I3">Université Pierre et Marie Curie - Paris6, INSERM UMRS 538 Biochimie-Biologie Moléculaire, Paris, F-75571, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Napoli, Amedeo" sort="Napoli, Amedeo" uniqKey="Napoli A" first="Amedeo" last="Napoli">Amedeo Napoli</name>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Devignes, Marie Dominique" sort="Devignes, Marie Dominique" uniqKey="Devignes M" first="Marie-Dominique" last="Devignes">Marie-Dominique Devignes</name>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">18460176</idno>
<idno type="pmc">2367630</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2367630</idno>
<idno type="RBID">PMC:2367630</idno>
<idno type="doi">10.1186/1471-2105-9-S4-S3</idno>
<date when="2008">2008</date>
<idno type="wicri:Area/Pmc/Corpus">000022</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000022</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Ontology-guided data preparation for discovering genotype-phenotype relationships</title>
<author>
<name sortKey="Coulet, Adrien" sort="Coulet, Adrien" uniqKey="Coulet A" first="Adrien" last="Coulet">Adrien Coulet</name>
<affiliation>
<nlm:aff id="I1">KIKA Medical, Paris, F-75012, France</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Smail Tabbone, Malika" sort="Smail Tabbone, Malika" uniqKey="Smail Tabbone M" first="Malika" last="Smaïl-Tabbone">Malika Smaïl-Tabbone</name>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Benlian, Pascale" sort="Benlian, Pascale" uniqKey="Benlian P" first="Pascale" last="Benlian">Pascale Benlian</name>
<affiliation>
<nlm:aff id="I3">Université Pierre et Marie Curie - Paris6, INSERM UMRS 538 Biochimie-Biologie Moléculaire, Paris, F-75571, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Napoli, Amedeo" sort="Napoli, Amedeo" uniqKey="Napoli A" first="Amedeo" last="Napoli">Amedeo Napoli</name>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Devignes, Marie Dominique" sort="Devignes, Marie Dominique" uniqKey="Devignes M" first="Marie-Dominique" last="Devignes">Marie-Dominique Devignes</name>
<affiliation>
<nlm:aff id="I2">LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2008">2008</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Complexity and amount of post-genomic data constitute two major factors limiting the application of Knowledge Discovery in Databases (KDD) methods in life sciences. Bio-ontologies may nowadays play key roles in knowledge discovery in life science providing semantics to data and to extracted units, by taking advantage of the progress of Semantic Web technologies concerning the understanding and availability of tools for knowledge representation, extraction, and reasoning.</p>
</sec>
<sec>
<title>Results</title>
<p>This paper presents a method that exploits bio-ontologies for guiding data selection within the preparation step of the KDD process. We propose three scenarios in which domain knowledge and ontology elements such as subsumption, properties, class descriptions, are taken into account for data selection, before the data mining step. Each of these scenarios is illustrated within a case-study relative to the search of genotype-phenotype relationships in a familial hypercholesterolemia dataset. The guiding of data selection based on domain knowledge is analysed and shows a direct influence on the volume and significance of the data mining results.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The method proposed in this paper is an efficient alternative to numerical methods for data selection based on domain knowledge. In turn, the results of this study may be reused in ontology modelling and data integration.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title>BMC Bioinformatics</journal-title>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">18460176</article-id>
<article-id pub-id-type="pmc">2367630</article-id>
<article-id pub-id-type="publisher-id">1471-2105-9-S4-S3</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-9-S4-S3</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Ontology-guided data preparation for discovering genotype-phenotype relationships</article-title>
</title-group>
<contrib-group>
<contrib id="A1" corresp="yes" contrib-type="author">
<name>
<surname>Coulet</surname>
<given-names>Adrien</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>adrien.coulet@loria.fr</email>
</contrib>
<contrib id="A2" contrib-type="author">
<name>
<surname>Smaïl-Tabbone</surname>
<given-names>Malika</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>malika.smail@loria.fr</email>
</contrib>
<contrib id="A3" contrib-type="author">
<name>
<surname>Benlian</surname>
<given-names>Pascale</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>pascale.benlian@sat.ap-hop-paris.fr</email>
</contrib>
<contrib id="A4" contrib-type="author">
<name>
<surname>Napoli</surname>
<given-names>Amedeo</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>amedeo.napoli@loria.fr</email>
</contrib>
<contrib id="A5" contrib-type="author">
<name>
<surname>Devignes</surname>
<given-names>Marie-Dominique</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>marie-dominique.devignes@loria.fr</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
KIKA Medical, Paris, F-75012, France</aff>
<aff id="I2">
<label>2</label>
LORIA (UMR 7503 CNRS-INPL-INRIA-Nancy2-UHP), Vandoeuvre-lès-Nancy, F- 54506, France</aff>
<aff id="I3">
<label>3</label>
Université Pierre et Marie Curie - Paris6, INSERM UMRS 538 Biochimie-Biologie Moléculaire, Paris, F-75571, France</aff>
<pub-date pub-type="collection">
<year>2008</year>
</pub-date>
<pub-date pub-type="epub">
<day>25</day>
<month>4</month>
<year>2008</year>
</pub-date>
<volume>9</volume>
<issue>Suppl 4</issue>
<supplement>
<named-content content-type="supplement-title">A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications</named-content>
<named-content content-type="supplement-editor">Paolo Romano, Michael Schroeder, Nicola Cannata and Roberto Marangoni</named-content>
</supplement>
<fpage>S3</fpage>
<lpage>S3</lpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/9/S4/S3"></ext-link>
<permissions>
<copyright-statement>Copyright © 2008 Coulet et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2008</copyright-year>
<copyright-holder>Coulet et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<p>This is an open access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0"></ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</p>
<pmc-comment> Coulet Adrien adrien.coulet@loria.fr Ontology-guided data preparation for discovering genotype-phenotype relationships 2008BMC Bioinformatics 9(Suppl 4): S3-. (2008)1471-2105(2008)9:Suppl 4urn:ISSN:1471-2105</pmc-comment>
</license>
</permissions>
<abstract>
<sec>
<title>Background</title>
<p>Complexity and amount of post-genomic data constitute two major factors limiting the application of Knowledge Discovery in Databases (KDD) methods in life sciences. Bio-ontologies may nowadays play key roles in knowledge discovery in life science providing semantics to data and to extracted units, by taking advantage of the progress of Semantic Web technologies concerning the understanding and availability of tools for knowledge representation, extraction, and reasoning.</p>
</sec>
<sec>
<title>Results</title>
<p>This paper presents a method that exploits bio-ontologies for guiding data selection within the preparation step of the KDD process. We propose three scenarios in which domain knowledge and ontology elements such as subsumption, properties, class descriptions, are taken into account for data selection, before the data mining step. Each of these scenarios is illustrated within a case-study relative to the search of genotype-phenotype relationships in a familial hypercholesterolemia dataset. The guiding of data selection based on domain knowledge is analysed and shows a direct influence on the volume and significance of the data mining results.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The method proposed in this paper is an efficient alternative to numerical methods for data selection based on domain knowledge. In turn, the results of this study may be reused in ontology modelling and data integration.</p>
</sec>
</abstract>
<conference>
<conf-date>12–15 June 2007</conf-date>
<conf-name>Seventh International Workshop on Network Tools and Applications in Biology (NETTAB 2007)</conf-name>
<conf-loc>Pisa, Italy</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>The Knowledge Discovery in Databases (KDD) process is based on three main operations: data preparation, data mining, and interpretation of the extracted units. This process is guided and controlled by an expert of the concerned domain. The KDD process has been successfully applied in various domains such as marketing, finance, and biomedicine [
<xref ref-type="bibr" rid="B1">1</xref>
].</p>
<p>However applications of KDD are limited by the fact that strong interactions between the system and domain experts are necessary. Data manipulated in life sciences are complex and data mining algorithms generate large volume of rough results. As a consequence, the interpretation step of KDD in biology, aimed at extracting new and relevant knowledge units, is a hard task, i.e. time-consuming and tedious for the domain expert.</p>
<p>In computer science, ontologies provide a shared understanding of knowledge about a particular domain [
<xref ref-type="bibr" rid="B2">2</xref>
]. Bio-ontologies are becoming more and more available and contribute to the understanding of the large amounts of data existing in life sciences [
<xref ref-type="bibr" rid="B3">3</xref>
]. The National Center for Biomedical Ontology (NCBO) has recently developed Bioportal that offers a unified panorama on available bio-ontologies [
<xref ref-type="bibr" rid="B4">4</xref>
,
<xref ref-type="bibr" rid="B5">5</xref>
].</p>
<p>One of the promising interests of bio-ontologies is their use for guiding the process of KDD as suggested by Anand [
<xref ref-type="bibr" rid="B6">6</xref>
], Cespivova [
<xref ref-type="bibr" rid="B7">7</xref>
], Gottgtroy [
<xref ref-type="bibr" rid="B8">8</xref>
], and Napoli [
<xref ref-type="bibr" rid="B9">9</xref>
]. This idea seems to be much more realistic now that Semantic Web advances have given rise to common standards and technologies for expressing and sharing ontologies [
<xref ref-type="bibr" rid="B10">10</xref>
].</p>
<p>In this way, the three main operations of KDD can take advantage of domain knowledge embedded in bio-ontologies.</p>
<p>(1) During the data preparation step, bio-ontologies can facilitate the integration of heterogeneous data and guide the selection of relevant data to be mined.</p>
<p>(2) During the mining step, domain knowledge allows the specification of constraints for guiding data mining algorithms by, e.g. narrowing the search space.</p>
<p>(3) During the interpretation step, domain knowledge helps experts to visualize and validate extracted units.</p>
<p>There exists a number of studies on the use of ontologies within the data mining step, e.g. [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B12">12</xref>
], and the interpretation step e.g. [
<xref ref-type="bibr" rid="B13">13</xref>
-
<xref ref-type="bibr" rid="B15">15</xref>
]. Only a few studies (detailed hereafter) has focused on the first step, namely data preparation. This is the purpose of the present paper.</p>
<p>Data preparation –or preprocessing– is aimed at improving the quality of the data, and consequently the efficiency of the KDD process. Methods for data preparation involve operations of different types: data integration, data cleaning, data transformation and data reduction [
<xref ref-type="bibr" rid="B16">16</xref>
]. These operations are not exclusive since they may be combined. For example, data transformation can have an impact on data cleaning during normalisation of data. Data integration can have an impact on data cleaning as well, when inconsistencies are detected and corrected, or when missing values are filled. Still regarding data integration, the use of ontologies has been theoretically and practically studied in life sciences [
<xref ref-type="bibr" rid="B17">17</xref>
,
<xref ref-type="bibr" rid="B18">18</xref>
]. In this way, we have defined and used an ontology for integrating data on genetic variants [
<xref ref-type="bibr" rid="B19">19</xref>
]. Perez-Rey
<italic>et al</italic>
. have developed OntoDataClean, an ontology-based tool aimed at solving inconsistencies, missing and wrong values in datasets [
<xref ref-type="bibr" rid="B20">20</xref>
]. Data transformation operation produce formatted data, i.e. normalised and smoothed data, ready for being processed by data mining algorithms. Euler and Sholz propose a special ontology related to the transformation process [
<xref ref-type="bibr" rid="B21">21</xref>
]. This ontology provides facilities to manipulate data by using conceptualization of the transformation process.</p>
<p>The role of data reduction process is to reduce the description of data, e.g. lowering the number of dimensions within the data, without altering the integrity of the initial data set. Strategies for data reduction include the followings.</p>
<p>
<bold>Data cube aggregation</bold>
produces data cubes for storing multidimensional aggregated data (e.g. extracted from a data warehouse) for OLAP analysis [
<xref ref-type="bibr" rid="B22">22</xref>
]. For example, data on daily sales hold on millions of items and can be aggregated into monthly sales of some selected categories of items.</p>
<p>
<bold>Dimension reduction</bold>
leads to the encoding of data in a reduced format, with or without loss with respect to the initial data set. For example, principal component analysis can be used for dimensionality reduction that applies projections of initial data onto a space of a smaller dimension.</p>
<p>
<bold>Data discretization</bold>
techniques are used to reduce the number of values of an attribute and consequently facilitate interpretation of mining results. Automatic discretization methods exist for continuous numerical attributes that recursively partition the attribute values according to a given scale. For example, the range of an attribute
<italic>price</italic>
can be divided by the means of histogram analysis into several intervals, which can in turn be iteratively aggregated into larger intervals. However, these methods do not apply for discrete or nominal attributes, when the attribute values of which are not ordered. The scale for an attribute has then to be manually defined by domain experts and possibly refined with the help of heuristic methods [
<xref ref-type="bibr" rid="B23">23</xref>
].</p>
<p>
<bold>Data selection</bold>
aims at identifying appropriate subsets among the initial set of attributes. This operation can be performed with the help of heuristic methods based on tests of significance or entropy-based attribute evaluation measures such as the information gain [
<xref ref-type="bibr" rid="B24">24</xref>
,
<xref ref-type="bibr" rid="B25">25</xref>
]. Data selection is one of the data reduction methods that is studied in this paper.</p>
<p>The use of domain knowledge in KDD process can be considered from two points of view. The first one uses knowledge about the KDD process itself, i.e. domain represented within ontologies are data transformation, data cleaning, or the whole KDD domain [
<xref ref-type="bibr" rid="B26">26</xref>
]. The second one uses knowledge related to the dataset domain [
<xref ref-type="bibr" rid="B18">18</xref>
], e.g. pharmacogenomics. The work presented in this article follows the second view, and focuses on data preparation, and more precisely, on data selection. In addition it is made precise how available domain knowledge –contained in a knowledge base (KB)– can assist the domain expert in selecting relevant attributes or object subsets.</p>
<p>Our case-study deals with genotype-phenotype relationships. Finding relationships between genotype and phenotype is of primary interest in biological research. Large scale clinical studies provide large mass of genomic and post-genomic data produced by high-throughput biotechnology devices (e.g. microarray, mass spectrometry). Recent studies [
<xref ref-type="bibr" rid="B27">27</xref>
-
<xref ref-type="bibr" rid="B29">29</xref>
] have shown that data mining methods can be used for extracting unexpected and hidden correlations between genotype and phenotype. However, these studies also illustrate the difficulty of achieving these analyses, mainly because of domain complexity and large volume of data to be analysed. Keeping this in mind, we will illustrate here the benefits of using ontology for data selection within a KDD process, whose objective is to extract relationships between genomic variants and phenotype traits. The data sources explored in the experience described in this paper have two origins: (i) there are private datasets resulting from clinical investigations relative to Familial Hypercholesterolemia (FH), (ii) there are public databases (i.e. dbSNP, HapMap, OMIM, and Locus Specific Databases) partially integrated within SNP-KB, a knowledge-base developed in our laboratory. An example of expected relationships that can be of interest, is in concern with modulator variants, i.e. any genomic variant (or group of variants) related to disease or disease symptom modulation. Various levels of severity are for example observed in FH depending on allele versions of two genomic variants in the
<italic>APOE</italic>
gene (rs7412 and rs429358) [
<xref ref-type="bibr" rid="B30">30</xref>
]. Modulator variants are of particular interest in pharmacogenomics since they are known to modulate the metabolism and effect of drugs [
<xref ref-type="bibr" rid="B31">31</xref>
].</p>
<p>The next section on results presents an overview of the ontology-guided data selection method. Three scenarios of data selection are described and illustrate the proposition and its advantages.</p>
</sec>
<sec>
<title>Results</title>
<sec>
<title>Overview</title>
<p>An overview of the method is given in Figure
<xref ref-type="fig" rid="F1">1</xref>
. Data relevant to the study are collected from various resources such as genomic variation databases, published pharmacogenomic studies and private datasets. Various operations are applied to these data: cleaning, integration and transformation. These operations aimed first at participating in the instantiation of an existing KB, and second at producing the “initial dataset”. In this study, a dataset is defined as a relation between set of objects (rows) and set of attributes (columns). A mapping is then built between objects and attributes of this dataset, and instances of the KB. Data selection results from the definition of a subset of instances in the KB, allowing the selection of corresponding objects and attributes with respect to the mapping. This process that takes as inputs the initial dataset and the KB, is controlled by the domain expert, and yields the “reduced dataset”. Characteristics of the ontology such as subsumption relationships, properties and class descriptions, are used to guide the choice of meaningful instance subsets. These subsets are in turn used for data selection. Data mining algorithms are then applied to the reduced dataset. In the three examples presented hereafter, two mining algorithms are used. The first algorithm is Zart that extracts Frequent Itemsets (FI) and Frequent Closed Itemsets (FCI). The latter are special itemsets that cannot be extended in the dataset (see the Methods section). The ratio FI / FCI increases with the redundancy level of the itemsets. The second algorithm is COBWEB, which carries on a clustering of data in an unsupervised way. Actually, the results of the clustering are simply characterized by the number of obtained clusters.</p>
<fig position="float" id="F1">
<label>Figure 1</label>
<caption>
<p>
<bold>Overview of the proposed method</bold>
The KDD process is divided into three main steps: data preparation, data mining, and data interpretation. The figure details data preparation within the KDD process and illustrates our method of data selection guided with domain knowledge. Data relevant to the study are collected from various resources such as genomic variation databases, published pharmacogenomic studies and private datasets. Various operations are applied to these data: cleaning, integration and transformation. Theses operations implies first an instantiation of a knowledge base (1), and second the design of the “initial dataset”(2). In this study, a dataset is defined as a relation between set of objects (rows) and set of attributes (columns). A mapping is then built between objects and attributes of this dataset and the instances from the KB (3). Data selection results from the definition of a subset of instances in the KB (4), allowing the selection of corresponding objects and attributes, with respect to the mapping. This process takes as inputs the initial dataset and the KB is controlled by the domain expert, and yields the “reduced dataset”. Characteristics of the ontology such as subsumption relationships, properties and class descriptions, are used to guide the definition of meaningful instance subsets. These subsets are in turn used for data selection. Data mining algorithms are then applied to the reduced dataset. The results of the mining operation are interpreted in terms of knowledge units that can be eventually integrated into the knowledge base.</p>
</caption>
<graphic xlink:href="1471-2105-9-S4-S3-1"></graphic>
</fig>
</sec>
<sec>
<title>Articulation between data and knowledge</title>
<p>Our method is based on a mapping between objects and attributes of the dataset, and instances of the KB. Thus, formalized knowledge within the KB can be used for guiding data selection. Figure
<xref ref-type="fig" rid="F2">2</xref>
illustrates this mapping in the case of genomic variants assigned to concepts of the SNP-KB such as
<italic>conserved domain</italic>
_
<italic>variant</italic>
,
<italic>coding</italic>
_
<italic>variant</italic>
,
<italic>non</italic>
_
<italic>coding</italic>
_
<italic>variant</italic>
,
<italic>haplotype</italic>
_
<italic>member</italic>
or
<italic>tag</italic>
_
<italic>snp</italic>
.</p>
<fig position="float" id="F2">
<label>Figure 2</label>
<caption>
<p>
<bold>Articulation between data and knowledge</bold>
Some classes of SNP- and SO-Pharm ontologies are shown as well as their assigned instances. The mapping between objects and attributes of the FH dataset, and instances of the KB is schematized.</p>
</caption>
<graphic xlink:href="1471-2105-9-S4-S3-2"></graphic>
</fig>
<p>The efficiency of the interaction between data and knowledge is mainly based on the instantiation process in the KB with collected data. This process is dependent on data integration issues and has to be controlled by the domain expert, who has to choose the most accurate class corresponding to the considered data. In this way, the domain expert is in charge of instantiating the right classes in the knowledge base. In practice, information about the mapping is stored in the KB during the instantiation process by adding a property to the created instance. It can be noticed that depending on modelling choices, one object or one attribute can be mapped to more than one instance. Three concrete scenarios for data selection are now described.</p>
</sec>
<sec>
<title>Progressive selection of specific variants – guided by subsumption</title>
<p>The first scenario assumes that significant relationships between genotypes and phenotypes can be easily extracted from a reduced dataset, in which only coding variants or variants of conserved protein domains are considered. In our method, this kind of reduction results from the selection in the SNP-KB of a subset of instances corresponding to most specific and adequate classes in the ontology, with respect to subsumption relationships. As illustrated in Table
<xref ref-type="table" rid="T1">1</xref>
, a progressive selection of the most specific variant instances, successively belonging to
<italic>variant</italic>
class and
<italic>coding</italic>
_
<italic>variant</italic>
and
<italic>conserved</italic>
_
<italic>domain</italic>
_
<italic>variant</italic>
subclasses, leads to a decreasing number of attributes related to variants in the dataset: progressively 289, 231, and 126 attributes. In practice, the guiding of instance selection is managed through a plug-in of Protégé 4 adapted for this purpose (see the Methods section).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Quantitative characterization of data mining results depending on attribute selection. </p>
<p>Table 1 gives quantitative information about output (number of itemsets and number of clusters) for two data mining methods involved in this experiment. A column corresponds to a various selection of attribute in the FH dataset.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td></td>
<td>
<italic>variant</italic>
</td>
<td>
<italic>coding</italic>
_
<italic>variant</italic>
</td>
<td>
<italic>conserved</italic>
_
<italic>domain</italic>
_
<italic>variant</italic>
</td>
<td>
<italic>tag</italic>
_
<italic>snp</italic>
</td>
</tr>
</thead>
<tbody>
<tr>
<td>Number of Variants</td>
<td>289</td>
<td>231</td>
<td>126</td>
<td>198</td>
</tr>
<tr>
<td colspan="5">
<hr></hr>
</td>
</tr>
<tr>
<td>FI (FCI) {ratio FI/FCI}</td>
<td>6928 (255) {27. 17}</td>
<td>314 (24) {13.08}</td>
<td>304 (12) {25.33}</td>
<td>300(28){10.71}</td>
</tr>
<tr>
<td>Clusters</td>
<td>194</td>
<td>186</td>
<td>56</td>
<td>40</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Table
<xref ref-type="table" rid="T1">1</xref>
shows in addition the amount of data mining results obtained when most specific classes of variants are selected. When all variants are considered (
<italic>variant</italic>
column), the total number of FI computed by Zart is 6928. With COBWEB, the total number of clusters is 194. At present, these results are complex to interpret due to the large volume of involved variants and the lack of contextual data. For example, coding and non coding variants cannot be distinguished.</p>
<p>The volume of data mining results progressively decreases as more reduced sets of variants are selected (
<italic>coding</italic>
_
<italic>variant</italic>
and
<italic>conserved</italic>
_
<italic>domain</italic>
_
<italic>variant</italic>
columns). This reduction can be read on the number of FI –from 6928 to 304– and of clusters –from 194 to 56– making results easier to interpret.</p>
<p>Being able to use subsumption relationships between ontology classes for guiding data selection is one main advantage resulting from the knowledge formalization effort, data integration and data cleaning preceding the SNP-KB instantiation.</p>
</sec>
<sec>
<title>Tag-SNP based variant unification – guided by object properties</title>
<p>The examination of the data mining results obtained with the complete variant dataset reveals a high proportion of trivial and redundant association rules. This reflects the existence of variants belonging to the same haplotype. In simple words, ahaplotype designates a group of variants that segregate uniformly and can be replaced by a smaller group of variant, called “tag-SNPs”. Replacing all members of a haplotype by corresponding tag-SNP(s) may lower the number of extracted redundant association rules.</p>
<p>Figure
<xref ref-type="fig" rid="F3">3</xref>
shows a haplotype composed of variants
<italic>rs</italic>
_
<italic>004</italic>
,
<italic>rs</italic>
_
<italic>005</italic>
,
<italic>rs</italic>
_
<italic>006</italic>
and
<italic>rs</italic>
_
<italic>007</italic>
, that can be replaced by the unique
<italic>rs</italic>
_
<italic>007</italic>
tag-SNP. This information, which actually depends on the description of a given haplotype (NA01234), enlightens a functional dependency between variant
<italic>rs</italic>
_
<italic>004</italic>
(or
<italic>rs</italic>
_
<italic>005</italic>
or
<italic>rs</italic>
_
<italic>006</italic>
) and
<italic>rs</italic>
_
<italic>007</italic>
. Such a functional dependency can be expressed in the SNP-knowledge base as follows.</p>
<fig position="float" id="F3">
<label>Figure 3</label>
<caption>
<p>
<bold>Tag-SNP variant unification</bold>
. This figure focuses on some classes and instances from Figure 2. It develops the description of
<italic>Haplotype</italic>
and the
<italic>isHaplotypeMemberOf</italic>
and
<italic>isTaggedBy</italic>
object properties used for illustrating functional dependencies between instances of
<italic>variants</italic>
and
<italic>tag</italic>
_
<italic>snp</italic>
.</p>
</caption>
<graphic xlink:href="1471-2105-9-S4-S3-3"></graphic>
</fig>
<p>
<italic>rs</italic>
_
<italic>004</italic>
:=
<bold>
<italic>isHaplotypeMemberOf</italic>
</bold>
<italic>(haplotype</italic>
_
<italic>NA01234)</italic>
</p>
<p>
<italic>rs</italic>
_
<italic>004</italic>
:=
<bold>
<italic>isHaplotypeMemberOf</italic>
</bold>
<italic>(
<bold>isTaggedBy</bold>
(rs</italic>
_
<italic>007))</italic>
</p>
<p>A knowledge base may include information about functional dependencies taking the form of object properties (or sequences of object properties). Since the SNP-KB includes haplotype descriptions issued from the HapMap project [
<xref ref-type="bibr" rid="B32">32</xref>
] and Haploview software [
<xref ref-type="bibr" rid="B33">33</xref>
], and includes
<italic>isHaplotypeMemberOf</italic>
and
<italic>isTaggedBy</italic>
properties, then it is possible to distinguish between tag-SNPs and other haplotype members in the SNP-KB. According to our method, reducing the dataset to tag-SNPs is based on the selection of a subset of variant instances of the
<italic>tag_snp</italic>
class. In the situation depicted in Figure
<xref ref-type="fig" rid="F3">3</xref>
, this implies in turn the removal of columns
<italic>rs</italic>
_
<italic>004</italic>
,
<italic>rs</italic>
_
<italic>005</italic>
, and
<italic>rs</italic>
_
<italic>006</italic>
in the dataset.</p>
<p>Applied to the FH initial dataset, this strategy considerably reduces the number of attributes (see Table
<xref ref-type="table" rid="T1">1</xref>
, compare the
<italic>variant</italic>
and
<italic>tag</italic>
_
<italic>snp</italic>
columns). The volume of extracted units to be interpreted is thus also considerably reduced, not only because of the lower number of attributes but also because of the reduced number of dependencies between selected attributes (see the percentage of non redundant rules). One main advantage of guiding this selection process with domain ontology is to dynamically use the representation of functional dependencies between simple haplotype members and representative tag-SNPs in the SNP-KB. The representation is dependent on the precision of haplotype construction and may evolve. Automated updating of haplotype representation and instantiation in the SNP-KB is under study.</p>
</sec>
<sec>
<title>Patient selection – guided by class definition and classification</title>
<p>In contrast with the two previous scenarios dedicated to attribute selection, e.g.
<italic>variant</italic>
, this paragraph illustrates object selection, e.g.
<italic>patient</italic>
selection, leading to a reduction of the dataset as well. This third scenario illustrates the selection of instances based on the description of classes within SO-Pharm ontology. SO-Pharm encompasses and extends SNP-Ontology (see Methods section).</p>
<p>In the FH case study, groups of patients suspected to present specific genotype-phenotype profiles are defined. Classes and properties of SO-Pharm allow to define four classes of patients: one already existing in SO-Pharm, and three others that are defined for the data selection.</p>
<p>
<italic>patient</italic>
(defined in SO-Pharm)</p>
<p>
<italic>patient</italic>
_α ≡
<italic>patient</italic>
⊓ ∃
<italic>presentsGenotypeItem (</italic>
<italic>(LDLR</italic>
_
<italic>mutation))</italic>
</p>
<p>
<italic>patient</italic>
_β ≡
<italic>patient</italic>
⊓ ∃
<italic>presentsGenotypeItem (</italic>
<italic>(no</italic>
_
<italic>LDLR</italic>
_
<italic>mutation))</italic>
</p>
<p>⊓ ∃
<italic>presentsPhenotypeItem (</italic>
<italic>(high</italic>
_
<italic>LDL</italic>
_
<italic>in</italic>
_
<italic>blood))</italic>
</p>
<p>
<italic>patient</italic>
_γ ≡
<italic>patient</italic>
⊓ ∃
<italic>presentsGenotypeItem (</italic>
<italic>(no</italic>
_
<italic>LDLR</italic>
_
<italic>mutation))</italic>
</p>
<p>⊓ ∃
<italic>presentsPhenotypeItem (</italic>
<italic>(normal</italic>
_
<italic>LDL</italic>
_
<italic>in</italic>
_
<italic>blood))</italic>
</p>
<p>Reasoning mechanisms as applied to instances classify patients according to their individual properties. This allows to detect and to select a set of objects sharing the same attributes, as a set of instances belonging to the same class. This selection may reduce the volume of data input for subsequent mining tasks, and allows the characterization and comparison of selected subgroups.</p>
</sec>
</sec>
<sec>
<title>Discussion</title>
<p>Data selection is a crucial step in KDD process and any attention paid to selection makes more efficient the KDD process. Indeed, the computational cost in space and time of data mining algorithms is exponential (at worst), and any reduction of the initial dataset has effect on the whole data mining process. In addition, the practical use of data mining algorithms is also often limited by size of datasets or machine capabilities. For example, the extraction of frequent itemsets from the FH dataset on a standard workstation with a Pentium 1.8Ghz and 2Mb of RAM has to be limited to the calculation of the “most frequent” itemsets since the minimum support has to be set very high (i.e. 96%). Data selection is an important operation participating to the preparation step of the KDD, allowing the data mining algorithm to handle large dataset. Comparative tests show that data selection reduces quite always the volume of results and, in some cases, the redundancy within the extracted units. The efficiency of data selection is not so surprising and demonstrates, to a certain extent, some advantages of using ontology. More importantly, an actual positive feedback from the domain expert has been observed, who has enthusiastically piloted the data selection, being assisted by an ontology. The smaller size of the results has been a second cause of satisfaction for the domain expert, since results of the data mining tests have revealed non-standard results that may be of interest with respect to the domain knowledge.</p>
<p>Ontology-guided data selection can be performed by taking advantage of subsumption relationships between ontology classes and by defining subsets of instances corresponding to the most specific classes. When association rules have been extracted from a reduced dataset, the subsumption relationships can be followed within the ontology, for generalizing the association rules. This bottom-up traversal of the ontology can be used, for example, to check whether an extracted association rule between a coding variant and a phenotypic trait can be extended to some non-coding variants. This kind of association may be observed when intron splice sites are affected as discussed in [
<xref ref-type="bibr" rid="B34">34</xref>
].</p>
</sec>
<sec>
<title>Conclusions</title>
<p>This paper illustrates how domain knowledge captured in bio-ontologies facilitates the KDD process. An approach for data selection has been proposed that takes good advantage of time and effort spent for the KB construction.</p>
<p>Three proposed scenarios of data selection can be combined in order to define optimized KDD strategies fulfilling biomedical objectives. For that purpose, additional scenarios can be planned such as object unification, i.e. grouping together patients from the same family and retaining a unique representative for the family, thus reducing the number of objects to be manipulated. The selection process depends on instance properties (object and data properties), and accordingly on data and instantiation quality. When an instance is missing or presents a fault, the selection will be erroneous or impossible. In this way, the available knowledge on haplotypes could also be used for completing missing values about observed alleles of each member of a haplotype.</p>
<p>Challenging future work consists in automatically formalizing the results of the KDD process within a knowledge representation language, for enriching both the ontology and the KB. Such a capability allows to iteratively run the KDD process, using more complete domain knowledge after each KDD iteration.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<sec>
<title>The FH dataset</title>
<p>Objects in the FH datasets are patients of a clinical study related to Familial Hypercholesterolemia. Attributes are data relative to the phenotype or the genotype of the patients.</p>
<p>The dataset concerns:</p>
<p>(α) patients affected by the genetic hypercholesterolemia (FH),</p>
<p>(β) patients affected by a non-genetic hypercholesterolemia, and</p>
<p>(γ) patients without any hypercholesterolemia.</p>
<p>Majority of genotype attributes (289/293) describes observed alleles for genomic variants of the
<italic>LDLR</italic>
gene. An example of genotype attribute is the observed allele for the variant located at position Chr19:11085058 (e.g. AA). Phenotype attributes describe traits usually observed when studying the metabolism of lipids. Two examples of phenotype attributes are the LDL blood concentration (e.g. [LDL]
<sub>b</sub>
=3gl
<sup>−1</sup>
) and the presence/absence of xanthoma. Table
<xref ref-type="table" rid="T2">2</xref>
describes quantitatively the dataset.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>Characteristics of the FH dataset. </p>
<p>The FH dataset results from a clinical study relative to Familial Hypercholesterolemia. Its size and composition are described in Table 2. Phenotype refers to phenotypic attributes including for instance LDL concentration in blood. Genotype attributes include 289 genomic variations of the
<italic>LDLR</italic>
gene and 3 attributes relative to the presence of mutations in 3 other genes.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td>Objects</td>
<td colspan="2">
<italic>Patients</italic>
</td>
<td>125</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3">Attributes</td>
<td>
<italic>Phenotype</italic>
</td>
<td>12</td>
<td rowspan="3">304</td>
</tr>
<tr>
<td colspan="2">
<hr></hr>
</td>
</tr>
<tr>
<td>
<italic>Genotype</italic>
</td>
<td>292</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>SNP-Ontology and SO-Pharm</title>
<p>The SNP-Ontology [
<xref ref-type="bibr" rid="B35">35</xref>
] includes a formal representation in OWL-DL (i.e. the Ontology Web Language) of genomic variations and their related concepts: sequence in which they are observed, haplotype they belong to, proteins they modify, database in which they are stored, etc. For this study, a SNP-Knowledge Base (SNP-KB) is populated according to the semantic structure of the SNP-Ontology and integrating knowledge about genomic variations of the
<italic>LDLR</italic>
gene (Figure
<xref ref-type="fig" rid="F2">2</xref>
). Partially integrated data sources are dbSNP, HapMap, OMIM and private or public Locus Specific Databases [
<xref ref-type="bibr" rid="B36">36</xref>
]. The method used to populate the SNP-KB is described precisely in [
<xref ref-type="bibr" rid="B19">19</xref>
].</p>
<p>SO-Pharm is an OWL-DL ontology embedding knowledge about clinical studies in pharmacogenomics [
<xref ref-type="bibr" rid="B37">37</xref>
,
<xref ref-type="bibr" rid="B38">38</xref>
]. SO-Pharm satisfies all quality principles defined by the OBO Foundry [
<xref ref-type="bibr" rid="B39">39</xref>
]. It is closely articulated with the SNP-Ontology as with other ontologies that include knowledge about other pharmacogenomics sub-domains, i.e. related to drug, genotype, and phenotype. SO-Pharm and articulated ontologies are used to guide the data selection process.</p>
</sec>
<sec>
<title>Knowledge management and instance selection tools</title>
<p>Instantiation of classes in the ontologies is managed both with Protégé [
<xref ref-type="bibr" rid="B40">40</xref>
] and Jena API [
<xref ref-type="bibr" rid="B41">41</xref>
]. Consistency checking and classification are carried on with Pellet 1.4 [
<xref ref-type="bibr" rid="B42">42</xref>
]. Practically, the instance selection is performed through an adapted Protégé 4 plug-in [
<xref ref-type="bibr" rid="B43">43</xref>
]. This plug-in allows the selection of instances sharing characteristics, e.g. class membership, properties, relation with another specific instance, (a) by browsing and selecting items in hierarchies of classes, object properties and list of instances in a KB, (b) by answering DL queries with complex restrictions. This plug-in is currently under development and is planned to be released in a near future for the scientific community.</p>
</sec>
<sec>
<title>Data mining methods</title>
<p>Data mining tests have been run on the FH dataset with two different unsupervised algorithms. The first one, named Zart, extracts association rules after searching for frequent itemsets [
<xref ref-type="bibr" rid="B44">44</xref>
,
<xref ref-type="bibr" rid="B45">45</xref>
]. Zart generates itemsets of the form “
<italic>ABC</italic>
” from which in turn is derived an association rules such as “
<italic>AB implies C</italic>
”. An itemset is characterized by its support, i.e. the frequency of its occurrence in the dataset. Frequent Itemsets (FI) are itemsets with a support greater to a minimum threshold or minimum support, which has to be fixed by the domain expert. Frequent Closed Itemsets (FCI) are FI having the characteristic of not being included in any superset, i.e. a larger itemset, with the same support. Zart has been parameterized with a minimum support of 96% for the experiment. The principal motivation for using Zart is that this algorithm generates FI, FCI, and in addition, the so-called minimal generators allowing to infer the set of minimal non-redundant association rules. COBWEB is a second algorithm designing a structural clustering [
<xref ref-type="bibr" rid="B46">46</xref>
]. COBWEB is parameterized with an acuity=1 and a cutoff=0.5 that affect the construction of clusters with constraints on their relation and their cardinality. COBWEB is an algorithm of interest in the present study, because it generates a cluster hierarchy that can be reused in parallel with FI and FCI (the use of these clusters is planned in a future work).</p>
<p>The implementations of Zart and COBWEB mentioned just before are available respectively in the Coron platform [
<xref ref-type="bibr" rid="B47">47</xref>
] and the Weka toolbox [
<xref ref-type="bibr" rid="B48">48</xref>
].</p>
</sec>
</sec>
<sec>
<title>List of abbreviations used</title>
<p>API – Application Programming Interface</p>
<p>dbSNP – Single Nucleotide Polymorphism database</p>
<p>DL – Description Logics</p>
<p>FCI – Frequent Closed Itemset</p>
<p>FH – Familial Hypercholesterolemia</p>
<p>FI – Frequent Itemset</p>
<p>KB – Knowledge Base</p>
<p>KDD – Knowledge Discovery in Database</p>
<p>LDL – Low-Density Lipoprotein</p>
<p>
<italic>LDLR</italic>
– Low-Density Lipoprotein Receptor</p>
<p>NCBO – National Center for Biomedical Ontology</p>
<p>OBO – Open Biomedical Ontologies</p>
<p>OLAP – Online Analytical Processing</p>
<p>OMIM – Online Mendelian Inheritance in Man</p>
<p>OWL – Web Ontology Language</p>
<p>RAM – Random Access Memory</p>
<p>SNP – Single Nucleotide Polymorphism</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>AN carried the initial purpose of using domain knowledge in KDD process. AC, MS, MDD designed the method. AC implemented the framework and performed tests. PB carried out the FH clinical study and analyse data selection and data mining results. AC, MS, AN, MDD contributed to write the manuscript. All authors read and approved the final manuscript.</p>
</sec>
</body>
<back>
<ack>
<sec>
<title>Acknowledgements</title>
<p>This work has been partly funded by a European EUREKA-labelled research and development project, and the PRST “Intelligence Logicielle” (a Région Lorraine research project). The authors would like to thank the members of ISIBio working group and the participants of the Semantic Web summer school SWWW'06 for stimulating interactions.</p>
<p>This article has been published as part of
<italic>BMC Bioinformatics</italic>
Volume 9 Supplement 4, 2008: A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/9?issue=S4"></ext-link>
.</p>
</sec>
</ack>
<ref-list>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Frawley</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Piatetsky-Shapiro</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Matheus</surname>
<given-names>C</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Piatetsky-Shapiro G, Frawley WJ</surname>
</name>
</person-group>
<article-title>Knowledge Discovery in Databases: An Overview</article-title>
<source>Knowledge Discovery in Databases</source>
<year>1991</year>
<publisher-name>Cambridge: AAAI/MIT Press</publisher-name>
<fpage>1</fpage>
<lpage>30</lpage>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gruber</surname>
<given-names>TR</given-names>
</name>
</person-group>
<article-title>A Translation Approach to Portable Ontology Specifications</article-title>
<source>Knowledge Acquisition</source>
<year>1993</year>
<volume>5</volume>
<fpage>199</fpage>
<lpage>220</lpage>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bodenreider</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Stevens</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Bio-ontologies: current trends and future directions</article-title>
<source>Briefings in Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>256</fpage>
<lpage>274</lpage>
<pub-id pub-id-type="pmid">16899495</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="other">
<article-title>Bioportal</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.bioontology.org/tools/portal/bioportal.html"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rubin</surname>
<given-names>DL</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>SE</given-names>
</name>
<name>
<surname>Mungall</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Misra</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Westerfield</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sim</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Chute</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Solbrig</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Storey</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Day-Richter</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Noy</surname>
<given-names>NF</given-names>
</name>
<name>
<surname>Musen</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>National Center for Biomedical Ontology: Advancing Biomedicine through Structured Organization of Scientific Knowledge</article-title>
<source>OMICS</source>
<year>2006</year>
<volume>10</volume>
<fpage>185</fpage>
<lpage>198</lpage>
<pub-id pub-id-type="pmid">16901225</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Anand</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bell</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hughes</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>The Role of Domain Knowledge in Data Mining</article-title>
<source>Proceedings of the Conference on Information and Knowledge Management: 29 November – 02 December 1995; Baltimore</source>
<year>1995</year>
<publisher-name>New-York: ACM</publisher-name>
<fpage>37</fpage>
<lpage>43</lpage>
</citation>
</ref>
<ref id="B7">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Cespivova</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Rauch</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Svatek</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Kejkula</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Tomeckova</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Roles of Medical Ontology in Association Mining CRISP-DM Cycle</article-title>
<source>Proceedings of the ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies: 24 September 2004; Pisa</source>
<year>2004</year>
</citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gottgtroy</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Kasabov</surname>
<given-names>N</given-names>
</name>
<name>
<surname>MacDonell</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>An ontology driven approach for knowledge discovery in biomedicine</article-title>
<source>Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence: 9-13 August 2004; Auckland</source>
<year>2004</year>
<publisher-name>Berlin: Springer</publisher-name>
</citation>
</ref>
<ref id="B9">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Napoli</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Elements on KDDK: Knowledge Discovery guided by Domain Knowledge</article-title>
<source>Proceedings of the Conference on Concept Lattices and their Applications: 30 October – 1 November; Hammamet</source>
<year>2006</year>
</citation>
</ref>
<ref id="B10">
<citation citation-type="other">
<article-title>OWL Web Ontology Language Overview</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.w3.org/TR/owl-features/"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B11">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Karel</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Kléma</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Quantitative association rule mining in genomics using apriori knowledge</article-title>
<source>Proceedings of the ECML/PKDD07 Workshop Prior Conceptual Knowledge in Machine LEarning and Data Mining: 21 September; Warsaw</source>
<year>2007</year>
<fpage>53</fpage>
<lpage>64</lpage>
</citation>
</ref>
<ref id="B12">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Nazeri</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Bloedorn</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Exploiting Available Domain Knowledge to Improve Mining Aviation Safety and Network Security Data</article-title>
<source>Proceedings of the ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies: 24 September 2004; Pisa</source>
<year>2004</year>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Analyzing the subjective interestingness of association rules</article-title>
<source>IEEE Intellgent Systems</source>
<year>2000</year>
<volume>15</volume>
<fpage>47</fpage>
<lpage>55</lpage>
</citation>
</ref>
<ref id="B14">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Srikant</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Agrawal</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Mining generalized association rules</article-title>
<source>Proceedings of the 21th Very Large Data Bases Conference 8-10 September 1995; Zurich</source>
<year>1995</year>
<fpage>407</fpage>
<lpage>419</lpage>
</citation>
</ref>
<ref id="B15">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Svatek</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Rauch</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Flek</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Ontology-Based Explanation of Discovered Associations in the Domain of Social Reality</article-title>
<source>Proceeding of the ECML/PKDD05 Workshop on Knowledge Discovery and Ontologies: 7 October 2005; Porto</source>
<year>2005</year>
</citation>
</ref>
<ref id="B16">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Han</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Kamber</surname>
<given-names>M</given-names>
</name>
</person-group>
<source>Data Mining: Concepts and Techniques</source>
<year>2000</year>
<publisher-name>San-Francisco: Morgan Kaufmann Publishers</publisher-name>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goble</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Stevens</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bechhofer</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Paton</surname>
<given-names>NW</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>PG</given-names>
</name>
<name>
<surname>Peim</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Brass</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Transparent Access to Multiple Bioinformatics Information Sources</article-title>
<source>IBM Systems Journal Special issue on deep computing for the life sciences</source>
<year>2001</year>
<volume>40</volume>
<fpage>532</fpage>
<lpage>551</lpage>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Köhler</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Philippi</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lange</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>SEMEDA: ontology based semantic integration of biological databases</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>2420</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="pmid">14668226</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Coulet</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Smaïl-Tabbone</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Benlian</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Napoli</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Devignes</surname>
<given-names>MD</given-names>
</name>
</person-group>
<article-title>SNP-Converter: An Ontology-Based Solution to Reconcile Heterogeneous SNP Descriptions</article-title>
<source>Proceedings of the Workshop on Data Integration in the Life Sciences 20-22 July 2006; Hinxton</source>
<year>2006</year>
<publisher-name>Berlin: Springer</publisher-name>
<fpage>82</fpage>
<lpage>93</lpage>
<comment>LNBI 4075</comment>
</citation>
</ref>
<ref id="B20">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Pérez-Rey</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Anguita</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Crespo</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data</article-title>
<source>Proceedings of the International Symposium on Medical Data Analysis 7-8 December; Thessaloniki</source>
<year>2006</year>
<publisher-name>Berlin: Springer</publisher-name>
<fpage>262</fpage>
<lpage>272</lpage>
<comment>LNBI 4345</comment>
</citation>
</ref>
<ref id="B21">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Euler</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Scholz</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Using Ontologies in a KDD Workbench</article-title>
<source>Proceedings of the ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies 24 September 2004; Pisa</source>
<year>2004</year>
</citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Agarwal</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Agrawal</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Deshpande</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gupta</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Naughton</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ramakrishnan</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Sarawagi</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>On the Computation of Multidimensional Aggregates</article-title>
<source>Proceedings of the Very Large Data Bases Conference 03 – 06 September 1996; Bombay</source>
<year>1996</year>
<publisher-name>San-Francisco: Morgan Kaufmann Publishers Inc.</publisher-name>
<fpage>506</fpage>
<lpage>521</lpage>
</citation>
</ref>
<ref id="B23">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Han</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Databases</article-title>
<source>Proceedings of the AAAI Workshop on Knowledge Discovery in Databases 31 July – 4 August 1994; Seattle</source>
<year>1994</year>
<publisher-name>AAAI Press</publisher-name>
<fpage>157</fpage>
<lpage>168</lpage>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Han</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Feature selection based on rough set and information entropy</article-title>
<source>Proceedings of the IEEE International Conference on Granular Computing: 25-27 July 2005; Beijing</source>
<year>2005</year>
<volume>1</volume>
<fpage>153</fpage>
<lpage>158</lpage>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kohavi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>John</surname>
<given-names>GH</given-names>
</name>
</person-group>
<article-title>Wrappers for feature subset selection</article-title>
<source>Artificial Intelligence</source>
<year>1997</year>
<volume>97</volume>
<fpage>273</fpage>
<lpage>324</lpage>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bernstein</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Provost</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Hill</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Toward intelligent assistance for a data mining process an ontology-based approach for cost-sensitive classification</article-title>
<source>IEEE Transactions on Knowledge and Data Engineering</source>
<year>2005</year>
<volume>17</volume>
<fpage>503</fpage>
<lpage>518</lpage>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Creighton</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Hanash</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Mining gene expression databases for association rules</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<fpage>79</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="pmid">12499296</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Capriotti</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Fariselli</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Calabrese</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Casadio</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Predicting Protein Stability Changes from Sequences Using Support Vector Machines</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>ii54</fpage>
<lpage>ii58</lpage>
<pub-id pub-id-type="pmid">16204125</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Elston</surname>
<given-names>RC</given-names>
</name>
</person-group>
<article-title>Haplotype-based Quantitative Trait Mapping Using a Clustering Algorithm</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>258</fpage>
<pub-id pub-id-type="pmid">16709248</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ng</surname>
<given-names>MCY</given-names>
</name>
<name>
<surname>Baum</surname>
<given-names>L</given-names>
</name>
<name>
<surname>So</surname>
<given-names>WY</given-names>
</name>
<name>
<surname>Lam</surname>
<given-names>VKL</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Poon</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Tomlinson</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lindpaintner</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>JCN</given-names>
</name>
</person-group>
<article-title>Association of lipoprotein lipase S447X, apolipoprotein E exon 4, and apoC3 -455T-C polymorphisms on the susceptibility to diabetic nephropathy</article-title>
<source>Clin Genet</source>
<year>2006</year>
<volume>70</volume>
<fpage>20</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="pmid">16813599</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Giacomini</surname>
<given-names>KM</given-names>
</name>
<name>
<surname>Brett</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Altman</surname>
<given-names>RB</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The pharmacogenetics research network from SNP discovery to clinical drug response</article-title>
<source>Clin Pharmacol Ther</source>
<year>2007</year>
<volume>81</volume>
<fpage>328</fpage>
<lpage>45</lpage>
<pub-id pub-id-type="pmid">17339863</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="other">
<article-title>HapMap</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.hapmap.org/"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Barrett</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Fry</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Maller</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Daly</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>Haploview: analysis and visualization of LD and haplotype maps</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>263</fpage>
<lpage>5</lpage>
<pub-id pub-id-type="pmid">15297300</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hastings</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Resta</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Traum</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Stella</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Guanti</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Krainer</surname>
<given-names>AR</given-names>
</name>
</person-group>
<article-title>An LKB1 AT-AC intron mutation causes Peutz-Jeghers syndrome via splicing at noncanonical cryptic splice sites</article-title>
<source>Nat Struct Mol Biol</source>
<year>2005</year>
<volume>12</volume>
<fpage>54</fpage>
<lpage>59</lpage>
<pub-id pub-id-type="pmid">15608654</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="other">
<article-title>SNP-Ontology</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.bioontology.org/files/6723/snpontology_full.owl"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B36">
<citation citation-type="other">
<article-title>WayStation</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.centralmutations.org/"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B37">
<citation citation-type="other">
<article-title>SO-Pharm</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.obofoundry.org/cgi-bin/detail.cgi?id=pharmacogenomics"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B38">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Coulet</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Smaïl-Tabbone</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Napoli</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Devignes</surname>
<given-names>MD</given-names>
</name>
</person-group>
<article-title>Suggested Ontology for Pharmacogenomics (SO-Pharm): Modular Construction and Preliminary Testing</article-title>
<source>Proceedings of the Wokshop on Knowledge Systems in Bioinformatics 29 October 2006; Montpellier</source>
<year>2006</year>
<publisher-name>Berlin: Springer</publisher-name>
<fpage>648</fpage>
<lpage>57</lpage>
<comment>LNCS 4277</comment>
</citation>
</ref>
<ref id="B39">
<citation citation-type="other">
<article-title>Open Biomedical Ontologies (OBO) Foundry</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://obofoundry.org"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B40">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Knublauch</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Fergerson</surname>
<given-names>RW</given-names>
</name>
<name>
<surname>Noy</surname>
<given-names>NF</given-names>
</name>
<name>
<surname>Musen</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications</article-title>
<source>Proceedings of the Third International Semantic Web Conference 7-11 November 2004; Hiroshima</source>
<year>2004</year>
<publisher-name>Berlin: Springer</publisher-name>
</citation>
</ref>
<ref id="B41">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>McBride</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Jena: Implementing the RDF Model and Syntax Specification</article-title>
<source>Proceedings of the WWW2001 Workshop on the Semantic Web 1 May 2001; Hong Kong</source>
<year>2001</year>
</citation>
</ref>
<ref id="B42">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Sirin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Parsia</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Pellet: An OWL DL Reasoner</article-title>
<source>Proceedings of the Workshop on Description Logics 6-8 June 2004; Whistler</source>
<year>2004</year>
</citation>
</ref>
<ref id="B43">
<citation citation-type="other">
<article-title>Protégé 4 alpha plugins</article-title>
<comment>[
<ext-link ext-link-type="uri" xlink:href="http://www.co-ode.org/downloads/protege-x/plugins.php"></ext-link>
]</comment>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Agrawal</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Imielinski</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Swami</surname>
<given-names>AN</given-names>
</name>
</person-group>
<article-title>Mining Association Rules between Sets of Items in Large Databases</article-title>
<source>SIGMOD</source>
<year>1993</year>
<volume>22</volume>
<fpage>207</fpage>
</citation>
</ref>
<ref id="B45">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Szathmary</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Napoli</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Kuznetsov</surname>
<given-names>SO</given-names>
</name>
</person-group>
<article-title>ZART: A Multifunctional Itemset Mining Algorithm</article-title>
<source>Proceedings of the 5th International Conference on Concept Lattices and Their Applications 24-26 October 2007; Montpellier</source>
<year>2007</year>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fisher</surname>
<given-names>DH</given-names>
</name>
</person-group>
<article-title>Knowledge Acquisition via Incremental Conceptual Clustering</article-title>
<source>Machine Learning</source>
<year>1987</year>
<volume>2</volume>
<fpage>139</fpage>
<lpage>172</lpage>
</citation>
</ref>
<ref id="B47">
<citation citation-type="other">
<person-group person-group-type="author">
<name>
<surname>Szathmary</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Napoli</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>CORON: A Framework for Levelwise Itemset Mining Algorithms</article-title>
<source>Supplementary Proceedings of the Third International Conference on Formal Concept Analysis 14-18 February; Lens</source>
<year>2005</year>
<fpage>110</fpage>
<lpage>113</lpage>
</citation>
</ref>
<ref id="B48">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Witten</surname>
<given-names>IH</given-names>
</name>
<name>
<surname>Frank</surname>
<given-names>E</given-names>
</name>
</person-group>
<source>Data Mining Practical machine learning tools and techniques</source>
<year>2005</year>
<publisher-name>San-Francisco: Morgan Kaufmann Publishers</publisher-name>
</citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000022 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000022 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:2367630
   |texte=   Ontology-guided data preparation for discovering genotype-phenotype relationships
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:18460176" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a InforLorV4 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022