Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

IntelliGO: a new vector-based semantic similarity measure including annotation origin

Identifieur interne : 000075 ( Pmc/Corpus ); précédent : 000074; suivant : 000076

IntelliGO: a new vector-based semantic similarity measure including annotation origin

Auteurs : Sidahmed Benabderrahmane ; Malika Smail-Tabbone ; Olivier Poch ; Amedeo Napoli ; Marie-Dominique Devignes

Source :

RBID : PMC:3098105

Abstract

Background

The Gene Ontology (GO) is a well known controlled vocabulary describing the biological process, molecular function and cellular component aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (i.e. their evidence codes).

Results

We present here a new semantic similarity measure called IntelliGO which integrates several complementary properties in a novel vector space model. The coefficients associated with each GO term that annotates a given gene or protein include its information content as well as a customized value for each type of GO evidence code. The generalized cosine similarity measure, used for calculating the dot product between two vectors, has been rigorously adapted to the context of the GO graph. The IntelliGO similarity measure is tested on two benchmark datasets consisting of KEGG pathways and Pfam domains grouped as clans, considering the GO biological process and molecular function terms, respectively, for a total of 683 yeast and human genes and involving more than 67,900 pair-wise comparisons. The ability of the IntelliGO similarity measure to express the biological cohesion of sets of genes compares favourably to four existing similarity measures. For inter-set comparison, it consistently discriminates between distinct sets of genes. Furthermore, the IntelliGO similarity measure allows the influence of weights assigned to evidence codes to be checked. Finally, the results obtained with a complementary reference technique give intermediate but correct correlation values with the sequence similarity, Pfam, and Enzyme classifications when compared to previously published measures.

Conclusions

The IntelliGO similarity measure provides a customizable and comprehensive method for quantifying gene similarity based on GO annotations. It also displays a robust set-discriminating power which suggests it will be useful for functional clustering.

Availability

An on-line version of the IntelliGO similarity measure is available at: http://bioinfo.loria.fr/Members/benabdsi/intelligo_project/


Url:
DOI: 10.1186/1471-2105-11-588
PubMed: 21122125
PubMed Central: 3098105

Links to Exploration step

PMC:3098105

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">IntelliGO: a new vector-based semantic similarity measure including annotation origin</title>
<author>
<name sortKey="Benabderrahmane, Sidahmed" sort="Benabderrahmane, Sidahmed" uniqKey="Benabderrahmane S" first="Sidahmed" last="Benabderrahmane">Sidahmed Benabderrahmane</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Smail Tabbone, Malika" sort="Smail Tabbone, Malika" uniqKey="Smail Tabbone M" first="Malika" last="Smail-Tabbone">Malika Smail-Tabbone</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Poch, Olivier" sort="Poch, Olivier" uniqKey="Poch O" first="Olivier" last="Poch">Olivier Poch</name>
<affiliation>
<nlm:aff id="I2">L.B.G.I., CNRS UMR7104, IGBMC, 1 rue Laurent Fries, 67404 Illkirch Strasbourg, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Napoli, Amedeo" sort="Napoli, Amedeo" uniqKey="Napoli A" first="Amedeo" last="Napoli">Amedeo Napoli</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Devignes, Marie Dominique" sort="Devignes, Marie Dominique" uniqKey="Devignes M" first="Marie-Dominique" last="Devignes">Marie-Dominique Devignes</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21122125</idno>
<idno type="pmc">3098105</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098105</idno>
<idno type="RBID">PMC:3098105</idno>
<idno type="doi">10.1186/1471-2105-11-588</idno>
<date when="2010">2010</date>
<idno type="wicri:Area/Pmc/Corpus">000075</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000075</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">IntelliGO: a new vector-based semantic similarity measure including annotation origin</title>
<author>
<name sortKey="Benabderrahmane, Sidahmed" sort="Benabderrahmane, Sidahmed" uniqKey="Benabderrahmane S" first="Sidahmed" last="Benabderrahmane">Sidahmed Benabderrahmane</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Smail Tabbone, Malika" sort="Smail Tabbone, Malika" uniqKey="Smail Tabbone M" first="Malika" last="Smail-Tabbone">Malika Smail-Tabbone</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Poch, Olivier" sort="Poch, Olivier" uniqKey="Poch O" first="Olivier" last="Poch">Olivier Poch</name>
<affiliation>
<nlm:aff id="I2">L.B.G.I., CNRS UMR7104, IGBMC, 1 rue Laurent Fries, 67404 Illkirch Strasbourg, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Napoli, Amedeo" sort="Napoli, Amedeo" uniqKey="Napoli A" first="Amedeo" last="Napoli">Amedeo Napoli</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Devignes, Marie Dominique" sort="Devignes, Marie Dominique" uniqKey="Devignes M" first="Marie-Dominique" last="Devignes">Marie-Dominique Devignes</name>
<affiliation>
<nlm:aff id="I1">LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The Gene Ontology (GO) is a well known controlled vocabulary describing the
<italic>biological process</italic>
,
<italic>molecular function </italic>
and
<italic>cellular component </italic>
aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (
<italic>i.e</italic>
. their evidence codes).</p>
</sec>
<sec>
<title>Results</title>
<p>We present here a new semantic similarity measure called
<italic>IntelliGO </italic>
which integrates several complementary properties in a novel vector space model. The coefficients associated with each GO term that annotates a given gene or protein include its information content as well as a customized value for each type of GO evidence code. The generalized cosine similarity measure, used for calculating the dot product between two vectors, has been rigorously adapted to the context of the GO graph. The
<italic>IntelliGO </italic>
similarity measure is tested on two benchmark datasets consisting of KEGG pathways and Pfam domains grouped as clans, considering the GO
<italic>biological process </italic>
and
<italic>molecular function </italic>
terms, respectively, for a total of 683 yeast and human genes and involving more than 67,900 pair-wise comparisons. The ability of the
<italic>IntelliGO </italic>
similarity measure to express the biological cohesion of sets of genes compares favourably to four existing similarity measures. For inter-set comparison, it consistently discriminates between distinct sets of genes. Furthermore, the
<italic>IntelliGO </italic>
similarity measure allows the influence of weights assigned to evidence codes to be checked. Finally, the results obtained with a complementary reference technique give intermediate but correct correlation values with the sequence similarity, Pfam, and Enzyme classifications when compared to previously published measures.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The
<italic>IntelliGO </italic>
similarity measure provides a customizable and comprehensive method for quantifying gene similarity based on GO annotations. It also displays a robust set-discriminating power which suggests it will be useful for functional clustering.</p>
</sec>
<sec>
<title>Availability</title>
<p>An on-line version of the
<italic>IntelliGO </italic>
similarity measure is available at:
<ext-link ext-link-type="uri" xlink:href="http://bioinfo.loria.fr/Members/benabdsi/intelligo_project/">http://bioinfo.loria.fr/Members/benabdsi/intelligo_project/</ext-link>
</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Ashburner, M" uniqKey="Ashburner M">M Ashburner</name>
</author>
<author>
<name sortKey="Ball, C" uniqKey="Ball C">C Ball</name>
</author>
<author>
<name sortKey="Blake, J" uniqKey="Blake J">J Blake</name>
</author>
<author>
<name sortKey="Botstein, D" uniqKey="Botstein D">D Botstein</name>
</author>
<author>
<name sortKey="Butler, H" uniqKey="Butler H">H Butler</name>
</author>
<author>
<name sortKey="Cherry, M" uniqKey="Cherry M">M Cherry</name>
</author>
<author>
<name sortKey="Davis, A" uniqKey="Davis A">A Davis</name>
</author>
<author>
<name sortKey="Dolinski, K" uniqKey="Dolinski K">K Dolinski</name>
</author>
<author>
<name sortKey="Dwight, S" uniqKey="Dwight S">S Dwight</name>
</author>
<author>
<name sortKey="Eppig, J" uniqKey="Eppig J">J Eppig</name>
</author>
<author>
<name sortKey="Harris, M" uniqKey="Harris M">M Harris</name>
</author>
<author>
<name sortKey="Hill, D" uniqKey="Hill D">D Hill</name>
</author>
<author>
<name sortKey="Issel Tarver, L" uniqKey="Issel Tarver L">L Issel-Tarver</name>
</author>
<author>
<name sortKey="Kasarskis, A" uniqKey="Kasarskis A">A Kasarskis</name>
</author>
<author>
<name sortKey="Lewis, S" uniqKey="Lewis S">S Lewis</name>
</author>
<author>
<name sortKey="Matese, Jc" uniqKey="Matese J">JC Matese</name>
</author>
<author>
<name sortKey="Richardson, J" uniqKey="Richardson J">J Richardson</name>
</author>
<author>
<name sortKey="Ringwald, M" uniqKey="Ringwald M">M Ringwald</name>
</author>
<author>
<name sortKey="Rubin, G" uniqKey="Rubin G">G Rubin</name>
</author>
<author>
<name sortKey="Sherlock, G" uniqKey="Sherlock G">G Sherlock</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lord, Pw" uniqKey="Lord P">PW Lord</name>
</author>
<author>
<name sortKey="Stevens, Rd" uniqKey="Stevens R">RD Stevens</name>
</author>
<author>
<name sortKey="Brass, A" uniqKey="Brass A">A Brass</name>
</author>
<author>
<name sortKey="Goble, Ca" uniqKey="Goble C">CA Goble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Consortium, Tgo" uniqKey="Consortium T">TGO Consortium</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Barrell, D" uniqKey="Barrell D">D Barrell</name>
</author>
<author>
<name sortKey="Dimmer, E" uniqKey="Dimmer E">E Dimmer</name>
</author>
<author>
<name sortKey="Huntley, Rp" uniqKey="Huntley R">RP Huntley</name>
</author>
<author>
<name sortKey="Binns, D" uniqKey="Binns D">D Binns</name>
</author>
<author>
<name sortKey="O Donovan, C" uniqKey="O Donovan C">C O'Donovan</name>
</author>
<author>
<name sortKey="Apweiler, R" uniqKey="Apweiler R">R Apweiler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Khatri, P" uniqKey="Khatri P">P Khatri</name>
</author>
<author>
<name sortKey="Draghici, S" uniqKey="Draghici S">S Draghici</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huang, D" uniqKey="Huang D">D Huang</name>
</author>
<author>
<name sortKey="Sherman, B" uniqKey="Sherman B">B Sherman</name>
</author>
<author>
<name sortKey="Tan, Q" uniqKey="Tan Q">Q Tan</name>
</author>
<author>
<name sortKey="Collins, J" uniqKey="Collins J">J Collins</name>
</author>
<author>
<name sortKey="Alvord, Wg" uniqKey="Alvord W">WG Alvord</name>
</author>
<author>
<name sortKey="Roayaei, J" uniqKey="Roayaei J">J Roayaei</name>
</author>
<author>
<name sortKey="Stephens, R" uniqKey="Stephens R">R Stephens</name>
</author>
<author>
<name sortKey="Baseler, M" uniqKey="Baseler M">M Baseler</name>
</author>
<author>
<name sortKey="Lane, Hc" uniqKey="Lane H">HC Lane</name>
</author>
<author>
<name sortKey="Lempicki, R" uniqKey="Lempicki R">R Lempicki</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Beissbarth, T" uniqKey="Beissbarth T">T Beissbarth</name>
</author>
<author>
<name sortKey="Speed, Tp" uniqKey="Speed T">TP Speed</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Speer, N" uniqKey="Speer N">N Speer</name>
</author>
<author>
<name sortKey="Spieth, C" uniqKey="Spieth C">C Spieth</name>
</author>
<author>
<name sortKey="Zell, A" uniqKey="Zell A">A Zell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pesquita, C" uniqKey="Pesquita C">C Pesquita</name>
</author>
<author>
<name sortKey="Faria, D" uniqKey="Faria D">D Faria</name>
</author>
<author>
<name sortKey="Falcao, Ao" uniqKey="Falcao A">AO Falcão</name>
</author>
<author>
<name sortKey="Lord, P" uniqKey="Lord P">P Lord</name>
</author>
<author>
<name sortKey="Couto, Fm" uniqKey="Couto F">FM Couto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rogers, Mf" uniqKey="Rogers M">MF Rogers</name>
</author>
<author>
<name sortKey="Ben Hur, A" uniqKey="Ben Hur A">A Ben-Hur</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Du, Z" uniqKey="Du Z">Z Du</name>
</author>
<author>
<name sortKey="Li, L" uniqKey="Li L">L Li</name>
</author>
<author>
<name sortKey="Chen, Cf" uniqKey="Chen C">CF Chen</name>
</author>
<author>
<name sortKey="Yu, Ps" uniqKey="Yu P">PS Yu</name>
</author>
<author>
<name sortKey="Wang, Jz" uniqKey="Wang J">JZ Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Popescu, M" uniqKey="Popescu M">M Popescu</name>
</author>
<author>
<name sortKey="Keller, Jm" uniqKey="Keller J">JM Keller</name>
</author>
<author>
<name sortKey="Mitchell, Ja" uniqKey="Mitchell J">JA Mitchell</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ganesan, P" uniqKey="Ganesan P">P Ganesan</name>
</author>
<author>
<name sortKey="Garcia Molina, H" uniqKey="Garcia Molina H">H Garcia-Molina</name>
</author>
<author>
<name sortKey="Widom, J" uniqKey="Widom J">J Widom</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blanchard, E" uniqKey="Blanchard E">E Blanchard</name>
</author>
<author>
<name sortKey="Harzallah, M" uniqKey="Harzallah M">M Harzallah</name>
</author>
<author>
<name sortKey="Kuntz, P" uniqKey="Kuntz P">P Kuntz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tversky, A" uniqKey="Tversky A">A Tversky</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, Wn" uniqKey="Lee W">WN Lee</name>
</author>
<author>
<name sortKey="Shah, N" uniqKey="Shah N">N Shah</name>
</author>
<author>
<name sortKey="Sundlass, K" uniqKey="Sundlass K">K Sundlass</name>
</author>
<author>
<name sortKey="Musen, M" uniqKey="Musen M">M Musen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jiang, Jj" uniqKey="Jiang J">JJ Jiang</name>
</author>
<author>
<name sortKey="Conrath, Dw" uniqKey="Conrath D">DW Conrath</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Ga" uniqKey="Miller G">GA Miller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, Z" uniqKey="Wu Z">Z Wu</name>
</author>
<author>
<name sortKey="Palmer, M" uniqKey="Palmer M">M Palmer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lin, D" uniqKey="Lin D">D Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sevilla, Jl" uniqKey="Sevilla J">JL Sevilla</name>
</author>
<author>
<name sortKey="Segura, V" uniqKey="Segura V">V Segura</name>
</author>
<author>
<name sortKey="Podhorski, A" uniqKey="Podhorski A">A Podhorski</name>
</author>
<author>
<name sortKey="Guruceaga, E" uniqKey="Guruceaga E">E Guruceaga</name>
</author>
<author>
<name sortKey="Mato, Jm" uniqKey="Mato J">JM Mato</name>
</author>
<author>
<name sortKey="Martinez Cruz, La" uniqKey="Martinez Cruz L">LA Martinez-Cruz</name>
</author>
<author>
<name sortKey="Corrales, Fj" uniqKey="Corrales F">FJ Corrales</name>
</author>
<author>
<name sortKey="Rubio, A" uniqKey="Rubio A">A Rubio</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brameier, M" uniqKey="Brameier M">M Brameier</name>
</author>
<author>
<name sortKey="Wiuf, C" uniqKey="Wiuf C">C Wiuf</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rada, R" uniqKey="Rada R">R Rada</name>
</author>
<author>
<name sortKey="Mili, H" uniqKey="Mili H">H Mili</name>
</author>
<author>
<name sortKey="Bicknell, E" uniqKey="Bicknell E">E Bicknell</name>
</author>
<author>
<name sortKey="Blettner, M" uniqKey="Blettner M">M Blettner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nagar, A" uniqKey="Nagar A">A Nagar</name>
</author>
<author>
<name sortKey="Al Mubaid, H" uniqKey="Al Mubaid H">H Al-Mubaid</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Floridi, L" uniqKey="Floridi L">L Floridi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schlicker, A" uniqKey="Schlicker A">A Schlicker</name>
</author>
<author>
<name sortKey="Domingues, F" uniqKey="Domingues F">F Domingues</name>
</author>
<author>
<name sortKey="Rahnenfuhrer, J" uniqKey="Rahnenfuhrer J">J Rahnenfuhrer</name>
</author>
<author>
<name sortKey="Lengauer, T" uniqKey="Lengauer T">T Lengauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Jz" uniqKey="Wang J">JZ Wang</name>
</author>
<author>
<name sortKey="Du, Z" uniqKey="Du Z">Z Du</name>
</author>
<author>
<name sortKey="Payattakool, R" uniqKey="Payattakool R">R Payattakool</name>
</author>
<author>
<name sortKey="Yu, Ps" uniqKey="Yu P">PS Yu</name>
</author>
<author>
<name sortKey="Chen, Cf" uniqKey="Chen C">CF Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Othman, Rm" uniqKey="Othman R">RM Othman</name>
</author>
<author>
<name sortKey="Deris, S" uniqKey="Deris S">S Deris</name>
</author>
<author>
<name sortKey="Illias, Rm" uniqKey="Illias R">RM Illias</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nagar, A" uniqKey="Nagar A">A Nagar</name>
</author>
<author>
<name sortKey="Al Mubaid, H" uniqKey="Al Mubaid H">H Al-Mubaid</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Martin, D" uniqKey="Martin D">D Martin</name>
</author>
<author>
<name sortKey="Brun, C" uniqKey="Brun C">C Brun</name>
</author>
<author>
<name sortKey="Remy, E" uniqKey="Remy E">E Remy</name>
</author>
<author>
<name sortKey="Mouren, P" uniqKey="Mouren P">P Mouren</name>
</author>
<author>
<name sortKey="Thieffry, D" uniqKey="Thieffry D">D Thieffry</name>
</author>
<author>
<name sortKey="Jacq, B" uniqKey="Jacq B">B Jacq</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mistry, M" uniqKey="Mistry M">M Mistry</name>
</author>
<author>
<name sortKey="Pavlidis, P" uniqKey="Pavlidis P">P Pavlidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Guo, X" uniqKey="Guo X">X Guo</name>
</author>
<author>
<name sortKey="Liu, R" uniqKey="Liu R">R Liu</name>
</author>
<author>
<name sortKey="Shriver, Cd" uniqKey="Shriver C">CD Shriver</name>
</author>
<author>
<name sortKey="Hu, H" uniqKey="Hu H">H Hu</name>
</author>
<author>
<name sortKey="Liebman, Mn" uniqKey="Liebman M">MN Liebman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pesquita, C" uniqKey="Pesquita C">C Pesquita</name>
</author>
<author>
<name sortKey="Faria, D" uniqKey="Faria D">D Faria</name>
</author>
<author>
<name sortKey="Bastos, H" uniqKey="Bastos H">H Bastos</name>
</author>
<author>
<name sortKey="Ferreira, A" uniqKey="Ferreira A">A Ferreira</name>
</author>
<author>
<name sortKey="Falcao, Ao" uniqKey="Falcao A">AO Falcão</name>
</author>
<author>
<name sortKey="Couto, F" uniqKey="Couto F">F Couto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Salton, G" uniqKey="Salton G">G Salton</name>
</author>
<author>
<name sortKey="Mcgill, Mj" uniqKey="Mcgill M">MJ McGill</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Polettini, N" uniqKey="Polettini N">N Polettini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bodenreider, O" uniqKey="Bodenreider O">O Bodenreider</name>
</author>
<author>
<name sortKey="Aubry, M" uniqKey="Aubry M">M Aubry</name>
</author>
<author>
<name sortKey="Burgun, A" uniqKey="Burgun A">A Burgun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Glenisson, P" uniqKey="Glenisson P">P Glenisson</name>
</author>
<author>
<name sortKey="Antal, P" uniqKey="Antal P">P Antal</name>
</author>
<author>
<name sortKey="Mathys, J" uniqKey="Mathys J">J Mathys</name>
</author>
<author>
<name sortKey="Moreau, Y" uniqKey="Moreau Y">Y Moreau</name>
</author>
<author>
<name sortKey="Moor, Bd" uniqKey="Moor B">BD Moor</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chabalier, J" uniqKey="Chabalier J">J Chabalier</name>
</author>
<author>
<name sortKey="Mosser, J" uniqKey="Mosser J">J Mosser</name>
</author>
<author>
<name sortKey="Burgun, A" uniqKey="Burgun A">A Burgun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wright, Cc" uniqKey="Wright C">CC Wright</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blott, S" uniqKey="Blott S">S Blott</name>
</author>
<author>
<name sortKey="Camous, F" uniqKey="Camous F">F Camous</name>
</author>
<author>
<name sortKey="Gurrin, C" uniqKey="Gurrin C">C Gurrin</name>
</author>
<author>
<name sortKey="Jones, Gjf" uniqKey="Jones G">GJF Jones</name>
</author>
<author>
<name sortKey="Smeaton, Af" uniqKey="Smeaton A">AF Smeaton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Couto, Fm" uniqKey="Couto F">FM Couto</name>
</author>
<author>
<name sortKey="Silva, Mj" uniqKey="Silva M">MJ Silva</name>
</author>
<author>
<name sortKey="Coutinho, Pm" uniqKey="Coutinho P">PM Coutinho</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Catia" uniqKey="Catia">Catia</name>
</author>
<author>
<name sortKey="Pessoa, D" uniqKey="Pessoa D">D Pessoa</name>
</author>
<author>
<name sortKey="Faria, D" uniqKey="Faria D">D Faria</name>
</author>
<author>
<name sortKey="Couto, F" uniqKey="Couto F">F Couto</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benabderrahmane, S" uniqKey="Benabderrahmane S">S Benabderrahmane</name>
</author>
<author>
<name sortKey="Devignes, Md" uniqKey="Devignes M">MD Devignes</name>
</author>
<author>
<name sortKey="Smail Tabbone, M" uniqKey="Smail Tabbone M">M Smaïl Tabbone</name>
</author>
<author>
<name sortKey="Poch, O" uniqKey="Poch O">O Poch</name>
</author>
<author>
<name sortKey="Napoli, A" uniqKey="Napoli A">A Napoli</name>
</author>
<author>
<name sortKey="Nguyen N H, N" uniqKey="Nguyen N H N">N Nguyen N-H</name>
</author>
<author>
<name sortKey="Raffelsberger, W" uniqKey="Raffelsberger W">W Raffelsberger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Carbon, S" uniqKey="Carbon S">S Carbon</name>
</author>
<author>
<name sortKey="Ireland, A" uniqKey="Ireland A">A Ireland</name>
</author>
<author>
<name sortKey="Mungall, Cj" uniqKey="Mungall C">CJ Mungall</name>
</author>
<author>
<name sortKey="Shu, S" uniqKey="Shu S">S Shu</name>
</author>
<author>
<name sortKey="Marshall, B" uniqKey="Marshall B">B Marshall</name>
</author>
<author>
<name sortKey="Lewis, S" uniqKey="Lewis S">S Lewis</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ovaska, K" uniqKey="Ovaska K">K Ovaska</name>
</author>
<author>
<name sortKey="Laakso, M" uniqKey="Laakso M">M Laakso</name>
</author>
<author>
<name sortKey="Hautaniemi, S" uniqKey="Hautaniemi S">S Hautaniemi</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21122125</article-id>
<article-id pub-id-type="pmc">3098105</article-id>
<article-id pub-id-type="publisher-id">1471-2105-11-588</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-11-588</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>IntelliGO: a new vector-based semantic similarity measure including annotation origin</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>Benabderrahmane</surname>
<given-names>Sidahmed</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>benabdsi@loria.fr</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Smail-Tabbone</surname>
<given-names>Malika</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>malika@loria.fr</email>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Poch</surname>
<given-names>Olivier</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
<email>poch@titus.u-strasbg.fr</email>
</contrib>
<contrib contrib-type="author" id="A4">
<name>
<surname>Napoli</surname>
<given-names>Amedeo</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>napoli@loria.fr</email>
</contrib>
<contrib contrib-type="author" id="A5">
<name>
<surname>Devignes</surname>
<given-names>Marie-Dominique</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>devignes@loria.fr</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
LORIA (CNRS, INRIA, Nancy-Université), Équipe Orpailleur, Bâtiment B, Campus scientifique, 54506 Vandoeuvre-lès-Nancy Cedex, France</aff>
<aff id="I2">
<label>2</label>
L.B.G.I., CNRS UMR7104, IGBMC, 1 rue Laurent Fries, 67404 Illkirch Strasbourg, France</aff>
<pub-date pub-type="collection">
<year>2010</year>
</pub-date>
<pub-date pub-type="epub">
<day>1</day>
<month>12</month>
<year>2010</year>
</pub-date>
<volume>11</volume>
<fpage>588</fpage>
<lpage>588</lpage>
<history>
<date date-type="received">
<day>19</day>
<month>5</month>
<year>2010</year>
</date>
<date date-type="accepted">
<day>1</day>
<month>12</month>
<year>2010</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright ©2010 Benabderrahmane et al; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2010</copyright-year>
<copyright-holder>Benabderrahmane et al; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/11/588"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>The Gene Ontology (GO) is a well known controlled vocabulary describing the
<italic>biological process</italic>
,
<italic>molecular function </italic>
and
<italic>cellular component </italic>
aspects of gene annotation. It has become a widely used knowledge source in bioinformatics for annotating genes and measuring their semantic similarity. These measures generally involve the GO graph structure, the information content of GO aspects, or a combination of both. However, only a few of the semantic similarity measures described so far can handle GO annotations differently according to their origin (
<italic>i.e</italic>
. their evidence codes).</p>
</sec>
<sec>
<title>Results</title>
<p>We present here a new semantic similarity measure called
<italic>IntelliGO </italic>
which integrates several complementary properties in a novel vector space model. The coefficients associated with each GO term that annotates a given gene or protein include its information content as well as a customized value for each type of GO evidence code. The generalized cosine similarity measure, used for calculating the dot product between two vectors, has been rigorously adapted to the context of the GO graph. The
<italic>IntelliGO </italic>
similarity measure is tested on two benchmark datasets consisting of KEGG pathways and Pfam domains grouped as clans, considering the GO
<italic>biological process </italic>
and
<italic>molecular function </italic>
terms, respectively, for a total of 683 yeast and human genes and involving more than 67,900 pair-wise comparisons. The ability of the
<italic>IntelliGO </italic>
similarity measure to express the biological cohesion of sets of genes compares favourably to four existing similarity measures. For inter-set comparison, it consistently discriminates between distinct sets of genes. Furthermore, the
<italic>IntelliGO </italic>
similarity measure allows the influence of weights assigned to evidence codes to be checked. Finally, the results obtained with a complementary reference technique give intermediate but correct correlation values with the sequence similarity, Pfam, and Enzyme classifications when compared to previously published measures.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>The
<italic>IntelliGO </italic>
similarity measure provides a customizable and comprehensive method for quantifying gene similarity based on GO annotations. It also displays a robust set-discriminating power which suggests it will be useful for functional clustering.</p>
</sec>
<sec>
<title>Availability</title>
<p>An on-line version of the
<italic>IntelliGO </italic>
similarity measure is available at:
<ext-link ext-link-type="uri" xlink:href="http://bioinfo.loria.fr/Members/benabdsi/intelligo_project/">http://bioinfo.loria.fr/Members/benabdsi/intelligo_project/</ext-link>
</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>1 Background</title>
<sec>
<title>1.1 Gene annotation</title>
<p>The Gene Ontology (GO) has become one of the most important and useful resources in bioinformatics [
<xref ref-type="bibr" rid="B1">1</xref>
]. This ontology of about 30,000 terms is organized as a controlled vocabulary describing the
<italic>biological process </italic>
(BP),
<italic>molecular function </italic>
(MF), and
<italic>cellular component </italic>
(CC) aspects of gene annotation, also called GO aspects [
<xref ref-type="bibr" rid="B2">2</xref>
]. The GO vocabulary is structured as a rooted Directed Acyclic Graph (rDAG) in which GO terms are the nodes connected by different hierarchical relations (mostly
<italic>is_a </italic>
and
<italic>part_of </italic>
relations). The
<italic>is-a </italic>
relation describes the fact that a given child term is a specialization of a parent term, while the
<italic>part-of </italic>
relation denotes the fact that a child term is a component of a parent term. Another GO relation
<italic>regulates </italic>
expresses the fact that one process directly affects the manifestation of another process or quality [
<xref ref-type="bibr" rid="B3">3</xref>
]. However, this relation is not considered in most studies dealing with semantic similarity measures. By definition, each rDAG has a unique root node, relationships between nodes are oriented, and there are no cycles,
<italic>i.e</italic>
. no path starts and ends at the same node.</p>
<p>The GO Consortium regularly updates a GO Annotation (GOA) Database [
<xref ref-type="bibr" rid="B4">4</xref>
] in which appropriate GO terms are assigned to genes or gene products from public databases. GO annotations are widely used for data mining in several bioinformatics domains, including gene functional analysis of DNA microarrays data [
<xref ref-type="bibr" rid="B5">5</xref>
], gene clustering [
<xref ref-type="bibr" rid="B6">6</xref>
-
<xref ref-type="bibr" rid="B8">8</xref>
], and semantic gene similarity [
<xref ref-type="bibr" rid="B9">9</xref>
].</p>
<p>It is worth noting that each GO annotation is summarized by an evidence code (EC) which traces the procedure that was used to assign a specific GO term to a given gene [
<xref ref-type="bibr" rid="B10">10</xref>
]. Out of all available ECs, only the Inferred from Electronic Annotation (IEA) code is not assigned by a curator. Manually assigned ECs fall into four general categories (see Section 2.4.3 and Table
<xref ref-type="table" rid="T1">1</xref>
): author statement, experimental analysis, computational analysis, and curatorial statements. The author statement (Auth) means that the annotation either cites a published reference as the source of information (TAS for Traceable Author Statement) or it does not cite a published reference (NAS for Non traceable Author Statement). An experimental (Exp) annotation means that the annotation is based on a laboratory experiment. There are five ECs which correspond to various specific types of experimental evidence (IDA, IPI, IMP, IGI, and IEP; see Table
<xref ref-type="table" rid="T1">1</xref>
for details), plus one non specific parent code which is simply denoted as Exp. The use of an Exp EC annotation is always accompanied by the citation of a published reference. A Comp means that the annotation is based on computational analysis performed under the supervision of a human annotator. There are six types of Comp EC which correspond to various specific computational analyses (ISS, RCA, ISA, ISO, ISM and IGC; see Table
<xref ref-type="table" rid="T1">1</xref>
for details). The curatorial statement (Cur) includes the IC (Inferred by Curator) code which is used when an annotation is not supported by any direct evidence but can be reasonably inferred by a curator from other GO annotations for which evidence is available. For example, if a gene product has been annotated as a transcription factor on some experimental basis, the curator may add an IC annotation to the cellular component term
<italic>nucleus</italic>
. The ND (No biological Data available) code also belongs to the Cur category and means that a curator could not find any biological information. In practice, annotators are asked to follow a detailed decision tree in order to qualify each annotation with the proper EC [
<xref ref-type="bibr" rid="B11">11</xref>
]. Ultimately, a reference can describe multiple methods, each of which provides evidence to assign a certain GO term to a particular gene product. It is therefore common to see multiple gene annotations with identical GO identifiers but different ECs.</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>EC weight lists assigned to the 16 GO ECs considered in this study</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th></th>
<th align="center" colspan="2">Auth</th>
<th align="center" colspan="8">Exp</th>
<th align="center" colspan="4">Comp</th>
<th align="center" colspan="2">Cur</th>
<th align="center">Auto</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">
<bold>EC</bold>
</td>
<td align="center">
<bold>TAS</bold>
</td>
<td align="center">
<bold>NAS</bold>
</td>
<td align="center">
<bold>EXP</bold>
</td>
<td align="center">
<bold>IDA</bold>
</td>
<td align="center">
<bold>IPI</bold>
</td>
<td align="center">
<bold>IMP</bold>
</td>
<td align="center">
<bold>IGI</bold>
</td>
<td align="center">
<bold>IEP</bold>
</td>
<td align="center">
<bold>ISS</bold>
</td>
<td align="center">
<bold>RCA</bold>
</td>
<td align="center">
<bold>ISA</bold>
</td>
<td align="center">
<bold>ISO</bold>
</td>
<td align="center">
<bold>ISM</bold>
</td>
<td align="center">
<bold>IGC</bold>
</td>
<td align="center">
<bold>IC</bold>
</td>
<td align="center">
<bold>ND</bold>
</td>
<td align="center">
<bold>IEA</bold>
</td>
</tr>
<tr>
<td colspan="18">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">List1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td align="center">1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="18">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">List2</td>
<td align="center">1</td>
<td align="center">0.5</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.5</td>
<td align="center">0</td>
<td align="center">0.4</td>
</tr>
<tr>
<td colspan="18">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">List3</td>
<td align="center">1</td>
<td align="center">0.5</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.8</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.6</td>
<td align="center">0.5</td>
<td align="center">0</td>
<td align="center">0</td>
</tr>
<tr>
<td colspan="18">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">List4</td>
<td align="center" colspan="16">0</td>
<td align="center">1</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Table 1: The various weights assigned to the ECs are listed in the following lines as EC weight lists 1 to 4. TAS: Traceable Author Statement; NAS: Non-traceable Author Statement; EXP: Inferred from Experiment; IDA: Inferred from Direct Assay; IPI: Inferred from Physical Interaction; IMP: Inferred from Mutant Phenotype; IGI: Inferred from Genetic Interaction; IEP: Inferred from Expression Pattern; ISS: Inferred from Sequence Similarity; RCA: Inferred from Reviewed Computational Analysis; ISA: Inferred from Sequence Alignment; ISO: Inferred from Sequence Orthology; ISM: Inferred from Sequence Model; IGC: Inferred from Genomic Context; IC: Inferred from Curator; IEA: Inferred from Electronic Annotation; ND: No biological Data available. The EC categories are indicated in the first line of the table. Auth: Author statement; Exp: Experimental; Comp: Computational Analysis; Cur: Curator statement; Auto: Automatically assigned.</p>
</table-wrap-foot>
</table-wrap>
<p>The statistical distribution of gene annotations with respect to the various ECs is shown in Figure
<xref ref-type="fig" rid="F1">1</xref>
for human and yeast BP and MF aspects. This figure shows that IEA annotations are clearly dominant in both species and for all GO aspects, but that some codes are not represented at all (e.g. ISM, IGC).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>Distribution of EC (evidence codes) in yeast and human gene annotations according to BP and MF aspects</bold>
. The number of annotations assigned to a gene with a given EC is represented for each EC. Note that some genes can be annotated twice with the same term but with a different EC. The cumulative numbers of all non-IEA annotations are 18,496 and 9,564 for the yeast BP and MF annotations, respectively, and 21,462 and 16,243 for the human BP and MF annotations, respectively. Statistics are derived from the NCBI annotation file, version June 2009.</p>
</caption>
<graphic xlink:href="1471-2105-11-588-1"></graphic>
</fig>
<p>However, the ratio between non-IEA and IEA annotations is different in yeast and human. It is about 2.0 and 0.8 for the yeast BP and MF annotations compared to about 0.8 and 0.6 for the corresponding human annotations, respectively. This observation reflects a higher contribution of non-IEA annotation in yeast and is somewhat expected because of the smaller size of yeast genome and because more experiments have been carried out on yeast. In summary, GO ECs add high value to gene annotations because they trace annotation origins. However, apart for the G-SESAME and
<italic>SimGIC </italic>
measures which select GO annotations on the basis of ECs [
<xref ref-type="bibr" rid="B12">12</xref>
], only a few of the gene similarity measures described so far can handle GO annotations differently according to their ECs [
<xref ref-type="bibr" rid="B9">9</xref>
], [
<xref ref-type="bibr" rid="B13">13</xref>
]. Hence, one objective of this paper is to introduce a new semantic similarity measure which takes into account GO annotations and their associated ECs.</p>
</sec>
<sec>
<title>1.2 Semantic similarity measure</title>
<sec>
<title>1.2.1 The notion of semantic similarity measure</title>
<p>Using the general notion of similarity to identify objects which share common attributes or characteristics appears in many contexts such as word sense disambiguation, spelling correction, and information retrieval [
<xref ref-type="bibr" rid="B14">14</xref>
,
<xref ref-type="bibr" rid="B15">15</xref>
]. Similarity methods based on this notion are often called
<italic>featural approaches </italic>
because they assume that items are represented by lists of features which describe their properties. Thus, a similarity comparison involves comparing the feature lists that represent the items [
<xref ref-type="bibr" rid="B16">16</xref>
].</p>
<p>A similarity measure is referred to as
<italic>semantic </italic>
if it can handle the relationships that exist between the features of the items being compared. Comparing documents described by terms from a thesaurus or an ontology typically involves measuring semantic similarity [
<xref ref-type="bibr" rid="B17">17</xref>
]. Authors such as Resnik [
<xref ref-type="bibr" rid="B18">18</xref>
] or Jiang and Conrath [
<xref ref-type="bibr" rid="B19">19</xref>
] are considered as pioneers in ontology-based semantic similarity measures thanks to their long investigations in general English linguistics [
<xref ref-type="bibr" rid="B20">20</xref>
]. A general framework for comparing semantic similarity measures in a subsumption hierarchy has been proposed by Blanchard
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B15">15</xref>
]. For these authors,
<italic>tree-based similarities </italic>
fall into two large categories, namely those which only depend on the hierarchical relationships between the terms [
<xref ref-type="bibr" rid="B21">21</xref>
] and those which incorporate additional statistics such as term frequency in a corpus [
<xref ref-type="bibr" rid="B22">22</xref>
].</p>
<p>In the biological domain, the term
<italic>functional similarity </italic>
was introduced to describe the similarity between genes or gene products as measured by the similarity between their GO functional annotation terms. Biologists often need to establish functional similarities between genes. For example, in gene expression studies, correlations have been demonstrated between gene expression and GO semantic similarities [
<xref ref-type="bibr" rid="B23">23</xref>
,
<xref ref-type="bibr" rid="B24">24</xref>
]. Because GO terms are organized in a rDAG, the functional similarity between genes can be calculated using a semantic similarity measure. In a recent review, Pesquita
<italic>et al</italic>
. define a semantic similarity measure as a function that, given two individual ontology terms or two sets of terms annotating two biological entities, returns a numerical value reflecting the closeness in meaning between them [
<xref ref-type="bibr" rid="B9">9</xref>
]. These authors distinguish the comparison between two ontology terms from the comparison between two sets of ontology terms.</p>
</sec>
<sec>
<title>1.2.2 Comparison between two terms</title>
<p>Concerning the comparison between individual ontology terms, the two types of approaches reviewed by Pesquita
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B9">9</xref>
] are similar to those proposed by Blanchard
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B15">15</xref>
], namely the
<italic>edge-based </italic>
measures which rely on counting edges in the graph, and
<italic>node-based </italic>
measures which exploit information contained in the considered term, its descendants and its parents.</p>
<p>In most
<italic>edge-based </italic>
measures, the
<italic>Shortest Path-Length </italic>
(SPL) is used as a distance measure between two terms in a graph. This indicator was used by Rada
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B25">25</xref>
] on MeSH (Medical Subject Headings) terms and by Al-Mubaid
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B26">26</xref>
] on GO terms. However, Pesquita
<italic>et al</italic>
. question whether SPL-based measures truly reflect the semantic closeness of two terms. Indeed these measures rely on two assumptions that are seldom true in biological ontologies, namely that nodes and edges are uniformly distributed, and that edges at the same level in a hierarchy correspond to the same semantic distance between terms.
<italic>Node-based </italic>
measures are probably the most cited semantic similarity measures. These mainly rely on the information content (IC) of the two terms being compared and of their closest common ancestor [
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B22">22</xref>
]. The information content of a term is based on its frequency, or probability, of occurring in a corpus. Resnik uses the negative logarithm of the probability of a term to quantify its information content,
<italic>IC</italic>
(
<italic>c
<sub>i</sub>
</italic>
) =
<italic>-Log</italic>
(
<italic>p</italic>
(
<italic>c
<sub>i</sub>
</italic>
)) [
<xref ref-type="bibr" rid="B18">18</xref>
,
<xref ref-type="bibr" rid="B27">27</xref>
]. Thus, a term with a high probability of occurring has a low IC. Conversely, very specific terms that are less frequent have a high IC. Intuitively, IC values increase as a function of depth in the hierarchy. Resnik's similarity measure between two terms consists of determining the IC of all common ancestors between two terms and selecting the maximal value,
<italic>i.e</italic>
. the IC of the most specific (i.e. lowest) common ancestor (LCA). In other words, if two terms share an ancestor with a high information content, they are considered to be semantically very similar. Since the maximum of this IC value can be greater than one, Lin introduced a normalization term into Resnik's measure yielding [
<xref ref-type="bibr" rid="B22">22</xref>
]:</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M1" name="1471-2105-11-588-i1" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Recently, Schlicker
<italic>et al</italic>
. improved Lin's measure by using a correction factor based on the probability of occurrence of the
<italic>LCA</italic>
. Indeed, a general ancestor should not bring too high a contribution to term comparison [
<xref ref-type="bibr" rid="B28">28</xref>
]. A limitation of
<italic>node-based </italic>
measures is that they cannot explicitly take into account the distance separating terms from their common ancestor [
<xref ref-type="bibr" rid="B9">9</xref>
]. Hybrid methods also exist which combine
<italic>edge-based </italic>
and
<italic>node-based </italic>
methods, such as those developed by Wang
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B29">29</xref>
] and Othman
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B30">30</xref>
].</p>
</sec>
<sec>
<title>1.2.3 Comparison between sets of terms</title>
<p>Concerning the comparison between sets of terms, the approaches reviewed by Pesquita
<italic>et al</italic>
. fall into two broad categories:
<italic>pairwise </italic>
methods which simply combine the semantic similarities between all pairs of terms, and
<italic>groupwise </italic>
methods which consider a set of terms as a mathematical set, a vector, or a graph. The various
<italic>pairwise </italic>
methods differ in the strategies chosen to calculate the pairwise similarity between terms and in how pairwise similarities are combined. These methods have been thoroughly reviewed previously [
<xref ref-type="bibr" rid="B9">9</xref>
]. Hence we concentrate here on two representative examples that we chose for comparison purposes, namely the Lord measure which uses the
<italic>node-based </italic>
Resnik measure in the pairwise comparison step, and the Al-Mubaid measure which uses an
<italic>edge-based </italic>
measure. The study by Lord
<italic>et al</italic>
. in 2003 [
<xref ref-type="bibr" rid="B2">2</xref>
] provides the first description of a semantic similarity measure for GO terms. Semantic similarity between proteins is calculated as the average of all pairwise Resnik similarities between the corresponding GO annotations. In contrast, the measure defined by Al-Mubaid
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B26">26</xref>
], [
<xref ref-type="bibr" rid="B31">31</xref>
] considers the shortest path length (SPL) matrix between all pairs of GO terms that annotate two genes or gene products. It then calculates the average of all SPL values in the matrix, which represents the path length between two gene products. Finally, a transfer function is applied to the average SPL to convert it into a similarity value (see Methods). In
<italic>group-wise </italic>
methods, non semantic similarity measures co-exist with semantic ones. For example, the early Jaccard and Dice methods of counting the percentage of common terms between two sets are clearly non semantic [
<xref ref-type="bibr" rid="B15">15</xref>
]. However, in subsequent studies, various authors used sets of GO terms that have been extended with all term ancestors [
<xref ref-type="bibr" rid="B32">32</xref>
], [
<xref ref-type="bibr" rid="B33">33</xref>
].</p>
<p>Graph-based similarity measures are currently implemented in the Bioconductor GOstats package [
<xref ref-type="bibr" rid="B34">34</xref>
]. Each protein or gene can be associated with a graph which is induced by taking the most specific GO terms annotating the protein, and by finding all parents of those terms up to the root node. The union-intersection and longest shared path (
<italic>SimUI </italic>
) method can be used to calculate the between-graph similarity, for example. This method was tested by Guo
<italic>et al</italic>
. on human regulatory pathways [
<xref ref-type="bibr" rid="B35">35</xref>
]. Recently, the
<italic>SimGIC </italic>
method was introduced to improve the
<italic>SimUI </italic>
method by weighting terms with their information content [
<xref ref-type="bibr" rid="B36">36</xref>
].</p>
<p>Finally, vector-based similarity measures need to define an
<italic>annotation Vector-Space Model </italic>
(VSM) by analogy to the classical VSM described for document retrieval [
<xref ref-type="bibr" rid="B37">37</xref>
], [
<xref ref-type="bibr" rid="B38">38</xref>
], [
<xref ref-type="bibr" rid="B39">39</xref>
]. In the
<italic>annotation </italic>
VSM, each gene is represented by a vector
<inline-formula>
<mml:math id="M2" name="1471-2105-11-588-i2" overflow="scroll">
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
in a
<italic>k</italic>
-dimensional space constructed from basis vectors
<inline-formula>
<mml:math id="M3" name="1471-2105-11-588-i3" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
which correspond to the
<italic>k </italic>
annotation terms [
<xref ref-type="bibr" rid="B40">40</xref>
,
<xref ref-type="bibr" rid="B41">41</xref>
]. Thus, text documents and terms are replaced by gene and annotation terms, respectively, according to</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M4" name="1471-2105-11-588-i4" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mstyle>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<inline-formula>
<mml:math id="M5" name="1471-2105-11-588-i5" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is the
<italic>i</italic>
-th basis vector in the VSM
<italic>annotation </italic>
corresponding to the annotation term
<italic>t
<sub>i</sub>
</italic>
, and where
<italic>α
<sub>i </sub>
</italic>
is the coefficient of that term.</p>
<p>The DAVID tool, which was developed for functional characterization of gene clusters [
<xref ref-type="bibr" rid="B6">6</xref>
], uses this representation with binary coefficients which are set to 1 if a gene is annotated by a term and zero otherwise. Similarity is then calculated using "Kappa statistics" [
<xref ref-type="bibr" rid="B42">42</xref>
] which consider the significance of observed co-occurrences with respect to chance. However, this approach does not take into account the semantic similarity between functional annotation terms. In another study by Chabalier
<italic>et al</italic>
., the coefficients are defined as weights corresponding to the information content of each annotation term. The similarity between two genes is then computed using a cosine similarity measure. The semantic feature in Chabalier's method consists of a pre-filtering step which retains only those GO annotations at a certain level in the GO graph.</p>
<p>Ganesan
<italic>et al</italic>
. introduced a new vector-based semantic similarity measure in the domain of information retrieval [
<xref ref-type="bibr" rid="B14">14</xref>
]. When two annotation terms are different, this extended cosine measure allows the dot product between their corresponding vectors to be non-zero, thus expressing the semantic similarity that may exist between them. In other words, the components of the vector space are not mutually orthogonal. We decided to use this approach in the context of GO annotations. Hence the
<italic>IntelliGO </italic>
similarity measure defines a new vector-based representation of gene annotations with meaningful coefficients based on both information content and annotation origin. Vector comparison is based on the extended cosine measure and involves an
<italic>edge-based </italic>
similarity measure between each vector component.</p>
</sec>
</sec>
</sec>
<sec>
<title>2 Results</title>
<sec>
<title>2.1 The IntelliGO Vector Space Model to represent gene annotations</title>
<sec>
<title>2.1.1 The IntelliGO weighting scheme</title>
<p>The first originality of the
<italic>IntelliGO </italic>
VSM lies in its weighting scheme. The coefficients assigned to each vector component (GO term) are composed of two measures analogous to the
<italic>tf-idf </italic>
measures used for document retrieval [
<xref ref-type="bibr" rid="B43">43</xref>
]. On one hand, a weight
<italic>w</italic>
(
<italic>g</italic>
,
<italic>t
<sub>i</sub>
</italic>
) is assigned to the EC that traces the annotation origin and qualifies the importance of the association between a specific GO term
<italic>t
<sub>i </sub>
</italic>
and a given gene
<italic>g</italic>
. On the other hand, the
<italic>Inverse Annotation Frequency </italic>
(
<italic>IAF </italic>
) measure is defined for a given corpus of annotated genes as the ratio between the total number of genes
<italic>G
<sub>Tot </sub>
</italic>
and the number of genes
<inline-formula>
<mml:math id="M6" name="1471-2105-11-588-i6" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
annotated by the term
<italic>t
<sub>i</sub>
</italic>
. The
<italic>IAF </italic>
value of term
<italic>t
<sub>i </sub>
</italic>
is calculated as</p>
<p>
<disp-formula id="bmcM3">
<label>(3)</label>
<mml:math id="M7" name="1471-2105-11-588-i7" overflow="scroll">
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>F</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>g</mml:mi>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>This definition is clearly related to what was defined above as the information content of a GO term in an annotation corpus. It can be verified that GO terms which are frequently used to annotate genes in a corpus will display a low
<italic>IAF </italic>
value, whereas GO terms that are rarely used will display a high
<italic>IAF </italic>
which reflects their specificity and their potentially high contribution to vector comparison. In summary, the coefficient
<italic>α
<sub>i </sub>
</italic>
is defined as</p>
<p>
<disp-formula id="bmcM4">
<label>(4)</label>
<mml:math id="M8" name="1471-2105-11-588-i8" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>*</mml:mo>
<mml:mi>I</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>F</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
<sec>
<title>2.1.2 The IntelliGO generalized cosine similarity measure</title>
<p>The second innovative feature of the
<italic>IntelliGO </italic>
VSM concerns the basis vectors themselves. In classical VSMs, the basis is orthonormal, i.e. the base vectors are normed and mutually orthogonal. This corresponds to the assumption that each dimension of the vector space (here each annotation term) is independent from the others. In the case of gene annotation, this assumption obviously conflicts with the fact that GO terms are interrelated in the GO rDAG structure. Therefore, in the
<italic>IntelliGO </italic>
VSM, basis vectors are not considered as orthogonal to each other within a given GO aspect (BP, MF, or CC).</p>
<p>A similar situation has been handled by Ganesan
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B14">14</xref>
] in the context of document retrieval using a tree-hierarchy of indexating terms. Given two annotation terms,
<italic>t
<sub>i </sub>
</italic>
and
<italic>t
<sub>j</sub>
</italic>
, represented by their vectors,
<inline-formula>
<mml:math id="M9" name="1471-2105-11-588-i9" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula>
<mml:math id="M10" name="1471-2105-11-588-i10" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
, respectively, the
<italic>Generalized Cosine-Similarity Measure </italic>
(GCSM) defines the dot product between these two base vectors as</p>
<p>
<disp-formula id="bmcM5">
<label>(5)</label>
<mml:math id="M11" name="1471-2105-11-588-i11" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>The GCSM measure has been applied successfully by Blott
<italic>et al</italic>
. to a corpus of publications indexed using MeSH terms [
<xref ref-type="bibr" rid="B43">43</xref>
]. However applying the GCSM to the GO rDAG is not trivial. As mentioned above, in an rDAG there exist more than one path from one term to the
<italic>Root</italic>
. This has two consequences for the GCSM formula (5). Firstly, there may exist more than one LCA for two terms. Secondly, the depth value of a term is not unique but depends on the path which is followed up to the rDAG root. We therefore adapted the GCSM formula to rDAGs in a formal approach inspired by Couto
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B44">44</xref>
].</p>
<p>The GO controlled vocabulary can be defined as a triplet
<italic>γ </italic>
= (
<italic>T</italic>
, Ξ,
<italic>R</italic>
), where
<italic>T </italic>
is the set of annotation terms, Ξ is the set of the two main hierarchical relations that may hold between terms, i.e. Ξ = {
<italic>is-a, part-of </italic>
}. The third element
<italic>R </italic>
contains a set of triples τ = (
<italic>t</italic>
,
<italic>t</italic>
', ξ), where
<italic>t</italic>
,
<italic>t</italic>
' ∈
<italic>T </italic>
, ξ ∈ Ξ and
<italic>t</italic>
ξ
<italic>t</italic>
'. Note that ξ is an oriented child-parent relation and that ∀ τ ∈
<italic>R</italic>
, the relation ξ between
<italic>t </italic>
and
<italic>t</italic>
' is either
<italic>is-a </italic>
or
<italic>part-of</italic>
. In the
<italic>γ </italic>
vocabulary, the
<italic>Root </italic>
term represents the top-level node of the GO rDAG. Indeed,
<italic>Root </italic>
is the direct parent of three nodes,
<italic>BiologicalProcess</italic>
,
<italic>CellularComponent</italic>
, and
<italic>MolecularFunction</italic>
. These are also called aspect-specific roots. The
<italic>Root </italic>
node does not have any parents, and hence the collection
<italic>R </italic>
does not contain any triple in which
<italic>t </italic>
=
<italic>Root</italic>
. All GO terms in
<italic>T </italic>
are related to the root node through their aspect-specific root. Let
<italic>Parents </italic>
be a function that returns the set of direct parents of a given term
<italic>t</italic>
:</p>
<p>
<disp-formula id="bmcM6">
<label>(6)</label>
<mml:math id="M12" name="1471-2105-11-588-i12" overflow="scroll">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>T</mml:mi>
<mml:mtext></mml:mtext>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo></mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mi>ξ</mml:mi>
<mml:mo></mml:mo>
<mml:mi>Ξ</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mi>τ</mml:mi>
<mml:mo></mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>τ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>ξ</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>}</mml:mo>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>where
<inline-formula>
<mml:math id="M13" name="1471-2105-11-588-i48" overflow="scroll">
<mml:mi mathvariant="script">P</mml:mi>
</mml:math>
</inline-formula>
(
<italic>T </italic>
) refers to the set of all possible subsets of
<italic>T</italic>
. Note that
<italic>Parents</italic>
(
<italic>Root</italic>
) = ∅. The function
<italic>Parents </italic>
is used to define the
<italic>RootPath </italic>
function as the set of directed paths descending from the
<italic>Root </italic>
term to a given term
<italic>t</italic>
:</p>
<p>
<disp-formula id="bmcM7">
<label>(7)</label>
<mml:math id="M14" name="1471-2105-11-588-i13" overflow="scroll">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo>:</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="script">P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mo>{</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>}</mml:mo>
<mml:mo>}</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mtext></mml:mtext>
<mml:mi>t</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo>{</mml:mo>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>}</mml:mo>
<mml:mo>|</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo></mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>}</mml:mo>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>}</mml:mo>
<mml:mtext></mml:mtext>
<mml:mtext>otherwise</mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mtd>
<mml:mtd>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Thus, each path between the
<italic>Root </italic>
term and a term
<italic>t </italic>
is a set of terms Φ ∈
<italic>RootPath</italic>
(
<italic>t</italic>
).</p>
<p>The length of a path separating a term
<italic>t </italic>
from the
<italic>Root </italic>
term is defined as the number of edges connecting the nodes in the path, and is also called the
<italic>Depth </italic>
of term
<italic>t</italic>
. However, due to the multiplicity of paths in rDAG, there can be more than one depth value associated with a term. In the following, and by way of demonstration, we define
<italic>Depth</italic>
(
<italic>t</italic>
) as the function associating a term
<italic>t </italic>
with its maximal depth:</p>
<p>
<disp-formula id="bmcM8">
<label>(8)</label>
<mml:math id="M15" name="1471-2105-11-588-i14" overflow="scroll">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>a</mml:mi>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mtext></mml:mtext>
<mml:mo>|</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Note that since
<italic>RootPath</italic>
(
<italic>Root</italic>
) = {{Root}}, we have
<italic>Depth</italic>
(
<italic>Root</italic>
) = |{Root}| -1 = 0.</p>
<p>We then define the
<italic>Ancestors </italic>
function to identify an ancestor term of a given term
<italic>t </italic>
as any element
<italic>α </italic>
of a path Φ ∈
<italic>RootPath</italic>
(
<italic>t</italic>
).</p>
<p>
<disp-formula>
<mml:math id="M16" name="1471-2105-11-588-i15" overflow="scroll">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo>:</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>and</p>
<p>
<disp-formula id="bmcM9">
<label>(9)</label>
<mml:math id="M17" name="1471-2105-11-588-i16" overflow="scroll">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo></mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mi>Φ</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>Φ</mml:mi>
<mml:mo></mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo></mml:mo>
<mml:mi>Φ</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Thus, the common ancestors of two terms
<italic>t
<sub>a </sub>
</italic>
and
<italic>t
<sub>b </sub>
</italic>
can be defined as:</p>
<p>
<disp-formula id="bmcM10">
<label>(10)</label>
<mml:math id="M18" name="1471-2105-11-588-i17" overflow="scroll">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>A</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>A</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Let
<italic>LCAset</italic>
(
<italic>t
<sub>a</sub>
</italic>
,
<italic>t
<sub>b</sub>
</italic>
) be the set of lowest common ancestors of terms
<italic>t
<sub>a</sub>
</italic>
,
<italic>t
<sub>b</sub>
</italic>
. The lowest common ancestors are at the maximal distance from the root node. In other words their depth is the maximum depth of all terms
<italic>α </italic>
<italic>CommonAnc</italic>
(
<italic>t
<sub>a</sub>
</italic>
,
<italic>t
<sub>b</sub>
</italic>
). Note that this value is unique but it may correspond to more than one
<italic>LCA </italic>
term:</p>
<p>
<disp-formula>
<mml:math id="M19" name="1471-2105-11-588-i18" overflow="scroll">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>:</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>T</mml:mi>
<mml:mtext>x</mml:mtext>
<mml:mi>T</mml:mi>
<mml:mo></mml:mo>
<mml:mi mathvariant="script">P</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula id="bmcM11">
<label>(11)</label>
<mml:math id="M20" name="1471-2105-11-588-i19" overflow="scroll">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mtext></mml:mtext>
<mml:mo>|</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>a</mml:mi>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Having defined the
<italic>LCAset</italic>
, it is possible to define a subset of paths from the
<italic>Root </italic>
term to a given term
<italic>t </italic>
that pass through one of the
<italic>LCA </italic>
terms and subsequently ascend to the root node using the longest path between the
<italic>LCA </italic>
and the
<italic>Root </italic>
term. This notion is called
<italic>ConstrainedRootPath</italic>
, and can be calculated for any pair (t, s) with
<italic>s </italic>
<italic>Ancestors</italic>
(
<italic>t</italic>
):</p>
<p>
<disp-formula id="bmcM12">
<label>(12)</label>
<mml:math id="M21" name="1471-2105-11-588-i20" overflow="scroll">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>|</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>}</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>This leads to a precise definition of the path length
<italic>PL
<sub>k</sub>
</italic>
(
<italic>t</italic>
,
<italic>s</italic>
), for
<italic>s </italic>
<italic>Ancestors</italic>
(
<italic>t</italic>
) and for a given path Φ
<italic>
<sub>k </sub>
</italic>
<italic>ConstrainedRootPath</italic>
(
<italic>t</italic>
,
<italic>s</italic>
) as:</p>
<p>
<disp-formula id="bmcM13">
<label>(13)</label>
<mml:math id="M22" name="1471-2105-11-588-i21" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mtext></mml:mtext>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>Φ</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>For a given
<italic>LCA </italic>
<italic>LCAset</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>t
<sub>j</sub>
</italic>
), we can now define the shortest path length (SPL) between two terms
<italic>t
<sub>i </sub>
</italic>
and
<italic>t
<sub>j </sub>
</italic>
passing through this lowest common ancestor as</p>
<p>
<disp-formula id="bmcM14">
<label>(14)</label>
<mml:math id="M23" name="1471-2105-11-588-i22" overflow="scroll">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>P</mml:mi>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>h</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>P</mml:mi>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>h</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>The minimal SPL between terms
<italic>t
<sub>i </sub>
</italic>
and
<italic>t
<sub>j </sub>
</italic>
considering all their possible
<italic>LCA</italic>
s is thus given by</p>
<p>
<disp-formula id="bmcM15">
<label>(15)</label>
<mml:math id="M24" name="1471-2105-11-588-i23" overflow="scroll">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>l</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>l</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mtext></mml:mtext>
<mml:mo>|</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>l</mml:mi>
</mml:msub>
<mml:mtext></mml:mtext>
<mml:mo></mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>Returning to the GCSM formula (5), we now relate
<italic>Depth</italic>
(
<italic>t
<sub>i</sub>
</italic>
)+
<italic>Depth</italic>
(
<italic>t
<sub>j</sub>
</italic>
) in the denominator of the expression with
<italic>MinSPL</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>t
<sub>j</sub>
</italic>
) and
<italic>Depth</italic>
(
<italic>LCA</italic>
). Note that from (8) we have</p>
<p>
<italic>Depth</italic>
(
<italic>t
<sub>i</sub>
</italic>
) =
<italic>Max
<sub>k</sub>
</italic>
(|Φ
<italic>
<sub>k</sub>
</italic>
| - 1), with Φ
<italic>
<sub>k </sub>
</italic>
<italic>RootPath</italic>
(
<italic>t
<sub>i</sub>
</italic>
). From (13) we have</p>
<p>
<italic>PL
<sub>k</sub>
</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>LCA</italic>
) = |Φ
<italic>
<sub>k</sub>
</italic>
| - 1 -
<italic>Depth</italic>
(
<italic>LCA</italic>
) with Φ
<italic>
<sub>k </sub>
</italic>
<italic>ConstrainedRootPath</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>LCA</italic>
) and
<italic>ConstrainedRootPath</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>LCA</italic>
) ⊂
<italic>RootPath</italic>
(
<italic>t</italic>
). Given any
<italic>LCA </italic>
<italic>LCAset</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>t
<sub>j</sub>
</italic>
), it is then easy to demonstrate that</p>
<p>
<disp-formula id="bmcM16">
<label>(16)</label>
<mml:math id="M25" name="1471-2105-11-588-i24" overflow="scroll">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>P</mml:mi>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Similarly,</p>
<p>
<disp-formula id="bmcM17">
<label>(17)</label>
<mml:math id="M26" name="1471-2105-11-588-i25" overflow="scroll">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>h</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>P</mml:mi>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>h</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Thus,</p>
<p>
<disp-formula id="bmcM18">
<label>(18)</label>
<mml:math id="M27" name="1471-2105-11-588-i26" overflow="scroll">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mtext></mml:mtext>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo></mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtext></mml:mtext>
<mml:mtext></mml:mtext>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</p>
<p>In the case of a tree, this inequality becomes an equality.</p>
<p>The semantic similarity between two terms is assumed to be inversely proportional to the length of the path separating the two terms across their
<italic>LCA</italic>
. When we adapt the GCSM measure in (5) by replacing in the denominator the sum
<italic>Depth</italic>
(
<italic>t
<sub>i</sub>
</italic>
) +
<italic>Depth</italic>
(
<italic>t
<sub>j</sub>
</italic>
) by the smaller sum
<italic>MinSPL</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>t
<sub>j</sub>
</italic>
) + 2
<sub>* </sub>
<italic>Depth</italic>
(
<italic>LCA</italic>
), we ensure that the dot product between two base vectors will be maximized. With this adaptation, the
<italic>IntelliGO </italic>
dot product between two base vectors corresponding to two GO terms
<italic>t
<sub>i </sub>
</italic>
and
<italic>t
<sub>j </sub>
</italic>
is defined as</p>
<p>
<disp-formula id="bmcM19">
<label>(19)</label>
<mml:math id="M28" name="1471-2105-11-588-i27" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>One can verify that with this definition, the dot product takes values in the interval 0[
<xref ref-type="bibr" rid="B1">1</xref>
]. We observe that for
<inline-formula>
<mml:math id="M29" name="1471-2105-11-588-i28" overflow="scroll">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
, since
<italic>MinSPL</italic>
(
<italic>t
<sub>i</sub>
</italic>
,
<italic>t
<sub>j</sub>
</italic>
) = 0. Moreover, when two terms are only related through the root of the rDAG, we have
<inline-formula>
<mml:math id="M30" name="1471-2105-11-588-i29" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
because
<italic>Depth</italic>
(
<italic>Root</italic>
) = 0. In any other case, the value of the dot product represents a non zero
<italic>edge-based </italic>
similarity between terms. Note that this value clearly depends on the rDAG structure of the GO graph.</p>
</sec>
</sec>
<sec>
<title>2.2 The IntelliGO semantic similarity measure</title>
<p>In summary, the
<italic>IntelliGO </italic>
semantic similarity measure between two genes
<italic>g </italic>
and
<italic>h </italic>
represented by their vectors
<inline-formula>
<mml:math id="M31" name="1471-2105-11-588-i30" overflow="scroll">
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
and
<inline-formula>
<mml:math id="M32" name="1471-2105-11-588-i31" overflow="scroll">
<mml:mover accent="true">
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
</mml:math>
</inline-formula>
, respectively, is given by the following cosine formula:</p>
<p>
<disp-formula id="bmcM20">
<label>(20)</label>
<mml:math id="M33" name="1471-2105-11-588-i32" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>G</mml:mi>
<mml:mi>O</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>*</mml:mo>
<mml:mover accent="true">
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>*</mml:mo>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
</mml:mrow>
</mml:msqrt>
<mml:msqrt>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>*</mml:mo>
<mml:mover accent="true">
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where:</p>
<p>
<inline-formula>
<mml:math id="M34" name="1471-2105-11-588-i33" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</inline-formula>
: the vectorial representation of the gene
<italic>g </italic>
in the
<italic>IntelliGO </italic>
VSM.</p>
<p>
<inline-formula>
<mml:math id="M35" name="1471-2105-11-588-i34" overflow="scroll">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi>β</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>*</mml:mo>
</mml:mrow>
</mml:mstyle>
<mml:mtext></mml:mtext>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
: the vectorial representation of the gene
<italic>h </italic>
in the
<italic>IntelliGO </italic>
VSM.</p>
<p>
<italic>α
<sub>i </sub>
</italic>
=
<italic>w</italic>
(
<italic>g, t
<sub>i</sub>
</italic>
)*
<italic>IAF</italic>
(
<italic>t
<sub>i</sub>
</italic>
) the coefficient of term
<italic>t
<sub>i </sub>
</italic>
for gene
<italic>g</italic>
, where w(
<italic>g, t
<sub>i</sub>
</italic>
) represents the weight assigned to the evidence code between
<italic>t
<sub>i </sub>
</italic>
and
<italic>g</italic>
, and
<italic>IAF</italic>
(
<italic>t
<sub>i</sub>
</italic>
) is the inverse annotation frequency of the term
<italic>t
<sub>i</sub>
</italic>
.</p>
<p>
<italic>β
<sub>j </sub>
</italic>
=
<italic>w</italic>
(
<italic>h, t
<sub>j</sub>
</italic>
)*
<italic>IAF</italic>
(
<italic>t
<sub>j</sub>
</italic>
) the coefficient of term
<italic>t
<sub>j </sub>
</italic>
for gene
<italic>h</italic>
.</p>
<p>
<inline-formula>
<mml:math id="M36" name="1471-2105-11-588-i35" overflow="scroll">
<mml:mrow>
<mml:msup>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>*</mml:mo>
</mml:msup>
<mml:mover accent="true">
<mml:mi>h</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>β</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
</mml:mrow>
</mml:mstyle>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
: the dot product between the two gene vectors.</p>
<p>
<inline-formula>
<mml:math id="M37" name="1471-2105-11-588-i36" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>*</mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
: the dot product between
<inline-formula>
<mml:math id="M38" name="1471-2105-11-588-i37" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
and
<inline-formula>
<mml:math id="M39" name="1471-2105-11-588-i38" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>e</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo></mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>
if the corresponding terms
<italic>t
<sub>i </sub>
</italic>
and
<italic>t
<sub>j </sub>
</italic>
share common ancestors other than the rDAG root).</p>
</sec>
<sec>
<title>2.3 The IntelliGO Algorithm</title>
<p>The
<italic>IntelliGO </italic>
algorithm was designed to calculate the similarity measure between two genes, taking as input their identifiers in the NCBI GENE database, and as parameters a GO aspect (BP, MF, CC), a particular species, and a list of weights associated with GO ECs. The output is the
<italic>IntelliGO </italic>
similarity value between the two genes. In order to calculate this efficiently, we first extract from the NCBI annotation file [
<xref ref-type="bibr" rid="B45">45</xref>
] the list of all non redundant GO terms and the list of associated genes they annotate, whatever their evidence codes. The
<italic>IAF </italic>
values are then calculated and stored in the
<italic>SpeciesIAF </italic>
file. We then construct all possible pairs of GO terms and query the AMIGO database [
<xref ref-type="bibr" rid="B46">46</xref>
] to recover their
<italic>LCA</italic>
,
<italic>Depth</italic>
(
<italic>LCA</italic>
) and
<italic>SPL </italic>
values. Each dot product between two vectors representing two GO terms can thus be pre-calculated and stored in the
<italic>DotProduct </italic>
file.</p>
<p>The first step of the
<italic>IntelliGO </italic>
algorithm consists of filtering the NCBI file with the user's parameters (GO aspect, species and list of weights assigned to ECs) to produce a
<italic>CuratedAnnotation </italic>
file from which all genes of species and GO aspects other than those selected are removed. If a gene is annotated several times by the same GO term with different ECs, the program retains the EC having the greatest weight in the list of EC weights given as parameter. Then, for two input NCBI gene identifiers, the
<italic>IntelliGO </italic>
function (i) retrieves from the
<italic>CuratedAnnotation </italic>
file the list of GO terms annotating the two genes and their associated ECs, (ii) calculates from the
<italic>SpeciesIAF </italic>
file and the list of EC weights, all the coefficients of the two gene representations in the
<italic>IntelliGO </italic>
VSM, (iii) constructs the pairs of terms required to calculate the similarity value between the two vectors, (iv) assigns from the
<italic>DotProduct </italic>
file the corresponding value to each dot product, and (v) finally calculates the
<italic>IntelliGO </italic>
similarity value according to (20).</p>
</sec>
<sec>
<title>2.4 Testing the IntelliGO semantic similarity measure</title>
<sec>
<title>2.4.1 Benchmarking datasets and testing protocol</title>
<p>We evaluated our method using two different benchmarks depending on the GO aspect. For the KEGG benchmark, we selected a representative set of 13 yeast and 13 human diverse KEGG pathways [
<xref ref-type="bibr" rid="B47">47</xref>
] which contain a reasonable number of genes (between 10 and 30). The selected pathways are listed in Table
<xref ref-type="table" rid="T2">2</xref>
. The genes in these pathways were retrieved from KEGG using the
<italic>DBGET </italic>
database retrieval system [
<xref ref-type="bibr" rid="B48">48</xref>
]. Assuming that genes which belong to the same pathway are often related to a similar biological process, the similarity values calculated for this dataset should be related to the BP GO aspect.</p>
<table-wrap id="T2" position="float">
<label>Table 2</label>
<caption>
<p>List of yeast and human pathways used in this study.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">KEGG</th>
<th align="center">KEGG</th>
<th align="center">Yeast</th>
<th align="center">Name</th>
<th align="center">Nb genes</th>
<th align="center">Human</th>
<th align="center">Name</th>
<th align="center">Nb genes</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">
<bold>Category</bold>
</td>
<td align="center">
<bold>Subcategory</bold>
</td>
<td align="center">
<bold>Pathway</bold>
</td>
<td></td>
<td></td>
<td align="center">
<bold>Pathway</bold>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="8">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">01100 Metabolism</td>
<td align="center">01101 Carbohydrate Metabolism</td>
<td align="center">sce00562</td>
<td align="center">Inositol phosphate metabolism</td>
<td align="center">15</td>
<td align="center">hsa00040</td>
<td align="center">Pentose and glucuronate interconversions</td>
<td align="center">26</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01102 Energy Metabolism</td>
<td align="center">sce00920</td>
<td align="center">Sulfur metabolism</td>
<td align="center">13</td>
<td align="center">hsa00920</td>
<td align="center">Sulfur metabolism</td>
<td align="center">13</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01103 Lipid Metabolism</td>
<td align="center">sce00600</td>
<td align="center">Sphingolipid metabolism</td>
<td align="center">13</td>
<td align="center">hsa00140</td>
<td align="center">C21-Steroid homone metabolism</td>
<td align="center">17</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01105 Amino Acid</td>
<td align="center">sce00300</td>
<td align="center">Lysine biosynthesis</td>
<td align="center">13</td>
<td align="center">hsa00290</td>
<td align="center">Valine, leucine and isoleucine biosynthesis</td>
<td align="center">11</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">Metabolism</td>
<td align="center">sce00410</td>
<td align="center">Alanine biosynthesis</td>
<td align="center">8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01107 Glycan Biosynthesis and Metabolism</td>
<td align="center">sce00514</td>
<td align="center">O-Mannosyl glycan biosynthesis</td>
<td align="center">13</td>
<td align="center">hsa00563</td>
<td align="center">Glycosylphosphatidylinositol (GPI)-anchor biosynthesis</td>
<td align="center">23</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01109 Metabolism of Cofactors and Vitamins</td>
<td align="center">sce00670</td>
<td align="center">One carbon pool by folate</td>
<td align="center">14</td>
<td align="center">hsa00670</td>
<td align="center">One carbon pool by folate</td>
<td align="center">16</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01110 Biosynthesis of Secondary Metabolites</td>
<td align="center">sce00903</td>
<td align="center">Limonene and pinene degradation</td>
<td align="center">7</td>
<td align="center">hsa00232</td>
<td align="center">Caffeine metabolism</td>
<td align="center">7</td>
</tr>
<tr>
<td colspan="8">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">01120 Genetic Information Processing</td>
<td align="center">01121 Transcription</td>
<td align="center">sce03022</td>
<td align="center">Basal transcription factors</td>
<td align="center">24</td>
<td align="center">hsa03022</td>
<td align="center">Basal transcription factors</td>
<td align="center">38</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td align="center">hsa03020</td>
<td align="center">RNA polymerase</td>
<td align="center">29</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01123 Folding, Sorting and Degradation</td>
<td align="center">sce04130</td>
<td align="center">SNARE interactionst in vesicular Transport</td>
<td align="center">23</td>
<td align="center">hsa04130</td>
<td align="center">SNARE interactions in vesicular transport</td>
<td align="center">38</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">01124 Replication and Repair</td>
<td align="center">sce03450</td>
<td align="center">Non-homologous end-joining</td>
<td align="center">10</td>
<td align="center">hsa03450</td>
<td align="center">Non-homologous end-joining</td>
<td align="center">14</td>
</tr>
<tr>
<td></td>
<td colspan="7">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td align="center">hsa03430</td>
<td align="center">Mismatch repair</td>
<td align="center">23</td>
</tr>
<tr>
<td colspan="8">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">01130 Environmental Information Processing</td>
<td align="center">01132 Signal Transduction</td>
<td align="center">sce04070</td>
<td align="center">Phosphatidylinositol signaling system</td>
<td align="center">15</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="8">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">01140 Cellular Processes</td>
<td align="center">01151 Transport and Catabolism</td>
<td align="center">sce04140</td>
<td align="center">Regulation of autophagy</td>
<td align="center">17</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="8">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">01160 Human Diseases</td>
<td align="center">01164 Metabolic Disorders</td>
<td></td>
<td></td>
<td></td>
<td align="center">hsa04950</td>
<td align="center">Maturity onset diabetes of the young</td>
<td align="center">25</td>
</tr>
<tr>
<td colspan="8">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Total genes number</td>
<td></td>
<td></td>
<td></td>
<td align="center">185</td>
<td></td>
<td></td>
<td align="center">280</td>
</tr>
<tr>
<td colspan="8">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Non-IEA:IEA ratio</td>
<td></td>
<td></td>
<td></td>
<td align="center">572:435 (1.3)</td>
<td></td>
<td></td>
<td align="center">560:620 (0.9)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Table 2: The KEGG categories and subcategories are indicated for each pathway as well as its name and the number of genes it contains (KEGG version Dec 2009). The non-IEA:IEA ratio refers to
<italic>Biological Process </italic>
GO annotation of the complete set of genes for each species.</p>
</table-wrap-foot>
</table-wrap>
<p>For the Pfam benchmark, we selected a set of clans (groups of highly related Pfam entries) from the Sanger Pfam database [
<xref ref-type="bibr" rid="B49">49</xref>
]. In order to maximize diversity in the benchmarking dataset, yeast and human sequences were retrieved from the 10 different Pfam clans listed in Table
<xref ref-type="table" rid="T3">3</xref>
. For each selected Pfam clan, we used all the associated Pfam entry identifiers to query the Uniprot database [
<xref ref-type="bibr" rid="B50">50</xref>
] and retrieve the corresponding human and yeast gene identifiers. Assuming that genes which share common domains in a Pfam clan often have a similar molecular function, the similarity values calculated for this second dataset should be related to the MF GO aspect.</p>
<table-wrap id="T3" position="float">
<label>Table 3</label>
<caption>
<p>List of yeast and human genes and Pfam clans used this study.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center">Pfams clan accession (yeast)</th>
<th align="center">Nb genes</th>
<th align="center">Pfams clan name</th>
<th align="center">Pfams clan accession (human)</th>
<th align="center">Nb genes</th>
<th align="center">Pfams clan name</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">CL0328.1</td>
<td align="center">15</td>
<td align="center">2heme_cytochrom</td>
<td align="center">CL0099.10</td>
<td align="center">18</td>
<td align="center">ALDH-like</td>
</tr>
<tr>
<td align="center">CL0059.12</td>
<td align="center">13</td>
<td align="center">6_Hairpin</td>
<td align="center">CL0106.10</td>
<td align="center">8</td>
<td align="center">6PGD_C</td>
</tr>
<tr>
<td align="center">CL0092.9</td>
<td align="center">8</td>
<td align="center">ADF</td>
<td align="center">CL0417.1</td>
<td align="center">9</td>
<td align="center">BIR-like</td>
</tr>
<tr>
<td align="center">CL0099.10</td>
<td align="center">11</td>
<td align="center">ALDH-like</td>
<td align="center">CL0165.8</td>
<td align="center">5</td>
<td align="center">Cache</td>
</tr>
<tr>
<td align="center">CL0179.11</td>
<td align="center">11</td>
<td align="center">ATP-grasp</td>
<td align="center">CL0149.9</td>
<td align="center">7</td>
<td align="center">CoA-acyltrans</td>
</tr>
<tr>
<td align="center">CL0255.6</td>
<td align="center">7</td>
<td align="center">ATP_synthase</td>
<td align="center">CL0085.11</td>
<td align="center">12</td>
<td align="center">FAD_DHS</td>
</tr>
<tr>
<td align="center">CL0378.1</td>
<td align="center">10</td>
<td align="center">Ac-CoA-synth</td>
<td align="center">CL0076.9</td>
<td align="center">18</td>
<td align="center">FAD_Lum_binding</td>
</tr>
<tr>
<td align="center">CL0257.6</td>
<td align="center">18</td>
<td align="center">Acetyltrans-like</td>
<td align="center">CL0289.3</td>
<td align="center">6</td>
<td align="center">FBD</td>
</tr>
<tr>
<td align="center">CL0034.12</td>
<td align="center">11</td>
<td align="center">Amidohydrolase</td>
<td align="center">CL0119.10</td>
<td align="center">7</td>
<td align="center">Flavokinase</td>
</tr>
<tr>
<td align="center">CL0135.8</td>
<td align="center">14</td>
<td align="center">Arrestin_N-like</td>
<td align="center">CL0042.9</td>
<td align="center">10</td>
<td align="center">Flavoprotein</td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Total genes number</td>
<td align="center">118</td>
<td></td>
<td></td>
<td align="center">100</td>
<td></td>
</tr>
<tr>
<td colspan="6">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Non-IEA:IEA ratio</td>
<td align="center">121:366 (0.3)</td>
<td></td>
<td></td>
<td align="center">144:309 (0.46)</td>
<td></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Table 3: Clans are indicated by their accession identifier in the Sanger Pfam database (October 2009 release) and by the number of genes retrieved either in yeast (left part) or in human (right part). Each clan contains several Pfam entries listed in the Pfam_C file at [
<xref ref-type="bibr" rid="B57">57</xref>
]. The non-IEA:IEA ratio refers to the
<italic>Molecular Function </italic>
GO annotation of the complete set of genes for each species.</p>
</table-wrap-foot>
</table-wrap>
<p>For each set of genes, an
<italic>intra-set average gene similarity </italic>
was calculated as the average of all pairwise similarity values within a set of genes. In contrast, an
<italic>inter-set average gene similarity </italic>
was also calculated between two sets
<italic>S
<sub>a </sub>
</italic>
and
<italic>S
<sub>b </sub>
</italic>
as the average of all similarity values calculated for pairs of genes from each of the two sets
<italic>S
<sub>a </sub>
</italic>
and
<italic>S
<sub>b</sub>
</italic>
. A discriminating power can then be defined according to the ratio of the intra-set and inter-set average gene similarities (see Methods). We compared the values obtained with the
<italic>IntelliGO </italic>
similarity measure with the four other representative similarity measures described above, namely the Lord measure which is based on the Resnick term-term similarity, the Al-Mubaid measure which considers only the path length between GO terms [
<xref ref-type="bibr" rid="B31">31</xref>
], a standard vector-based cosine similarity measure, and the
<italic>SimGIC </italic>
measure which is one of the
<italic>graph-based </italic>
methods described above (see Section 1.2.3). For each dataset, we evaluated our measure firstly by comparing the intra-set similarity values with those obtained with other measures, and then by studying the effect of varying the list of weights assigned to the ECs. We then compared the discriminating power of the
<italic>IntelliGO </italic>
similarity measure with three other measures. We also tested our measure on a reference dataset using a recently available on-line evaluation tool.</p>
</sec>
<sec>
<title>2.4.2 Intra-set similarity</title>
<p>We produced all intra-set similarities with the
<italic>IntelliGO </italic>
similarity measure using EC
<italic>List1 </italic>
(all weights set to 1.0, see Table
<xref ref-type="table" rid="T1">1</xref>
). We also implemented and tested four other measures, namely Lord-normalized, Al-Mubaid, the classical
<italic>weighted-cosine </italic>
measure, and the SimGIC measure (see Methods) on the same sets of genes. The results obtained with the KEGG pathways using BP annotations are shown in Figure
<xref ref-type="fig" rid="F2">2</xref>
. For each KEGG pathway (x-axis), the intra-set similarity values are represented as histograms (y-axis). Similarity values vary from one pathway to another, reflecting variation in the coherence of gene annotations within pathways. Variations from one pathway to another are relatively uniform for all measures except the Lord measure. For example, intra-set similarity values of the sce00410 pathway are smaller than those of sce00300 for all measures except for the Lord measure. The same is observed between pathways hsa00920 and hsa00140.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Intra-set similarities with the KEGG pathway dataset using BP annotations</bold>
. The intra-set similarity is calculated as the mean of all pairwise gene similarities within a KEGG pathway, with the four measures compared in this study, namely,
<italic>IntelliGO </italic>
(using EC weight
<italic>List1 </italic>
), Lord-normalized, Al-Mubaid, and Weighted-cosine. A set of thirteen pathways were selected from the KEGG Pathway database for yeast (top panel) and human (bottom panel) pathways. Only BP annotations are used here (see also Table 2).</p>
</caption>
<graphic xlink:href="1471-2105-11-588-2"></graphic>
</fig>
<p>A positive feature of the
<italic>IntelliGO </italic>
similarity measure is that unlike other measures, all intra-set values are greater than or equal to 0.5. The relatively low values obtained with the
<italic>weighted-cosine </italic>
measure can be explained by the numerous null pairwise values generated by this method. This is because this measure assumes that the dimensions of the space vector are orthogonal to each other. Hence, whenever two genes lack a common annotation term their dot product is null, and so also is their similarity value. Indeed, null pairwise similarity values are observed in all pathways except for one in human and three in yeast (details not shown).</p>
<p>Very similar results were obtained with the Pfam benchmarking dataset which was analyzed on the basis of MF annotations (Figure
<xref ref-type="fig" rid="F3">3</xref>
). Here again, the
<italic>IntelliGO </italic>
similarity measure always provides intra-set similarity values greater than or equal to 0.5, which is not the case for the other measures. As before, the
<italic>weighted-cosine </italic>
yields the lowest intra-set similarity values for the reason explained above. This inconvenience led us to skip this measure in later stages of the work. In summary, our comparison of intra-set similarity values for two benchmarking datasets demonstrates the robustness of the
<italic>IntelliGO </italic>
similarity measure and its ability to capture the internal coherence of gene annotation within predefined sets of genes.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Intra-set similarities with the Pfam clan dataset using MF annotations</bold>
. The intra-set similarity is calculated for all genes of a given species within a Pfam clan using MF annotations. Two collections of ten Pfam clans were selected from the Sanger Pfam database to retrieve yeast (top panel) and human (bottom panel) genes belonging to these clans (see also Table 3).</p>
</caption>
<graphic xlink:href="1471-2105-11-588-3"></graphic>
</fig>
</sec>
<sec>
<title>2.4.3 Influence of EC weight lists</title>
<p>The second part of our evaluation is the study of the effect of varying the weights assigned to ECs in the
<italic>IntelliGO </italic>
similarity measure. As a first experiment, we used four lists of EC weights (see Table
<xref ref-type="table" rid="T1">1</xref>
). In
<italic>List1</italic>
, all EC weights are equal to 1.0, which makes all ECs equivalent in their contribution to the similarity value.
<italic>List1 </italic>
was used above to compare the
<italic>IntelliGO </italic>
measure with the four other similarity measures (Figure
<xref ref-type="fig" rid="F2">2</xref>
and
<xref ref-type="fig" rid="F3">3</xref>
) because these measures do not consider varying ECs weights in the calculation. In
<italic>List2</italic>
, the EC weights have been arbitrarily defined to represent the assumption that the
<italic>Exp </italic>
category of ECs is more reliable than the
<italic>Comp </italic>
category, and that the non-supervised
<italic>IEA </italic>
code is less reliable than
<italic>Comp </italic>
codes.
<italic>List3 </italic>
excludes
<italic>IEA </italic>
code in order to test the similarity measure when using only supervised annotations. Finally,
<italic>List4 </italic>
represents the opposite situation by retaining only the
<italic>IEA </italic>
code to test the contribution of IEA annotations.</p>
<p>These four lists were used to calculate IntelliGO intra-set similarity values on the same datasets as before. For each dataset, the distribution of all pairwise similarity values used to calculate the intra-set averages is shown in Figure
<xref ref-type="fig" rid="F4">4</xref>
with each weight list being shown as a histogram with a class interval of 0.2. On the left of each histogram a
<italic>Missing Values </italic>
bar (MV) shows the number of pairwise similarity values that cannot be calculated with
<italic>List3 </italic>
or
<italic>List4 </italic>
due to the complete absence of annotations for certain genes. As expected, since intra-set similarity values with the IntelliGO measure are greater than 0.5, the highest numbers of pairwise values are found in the intervals 0.6-0.8 and 0.8-1.0 for all weight lists considered here. For
<italic>List1 </italic>
and
<italic>List2</italic>
, the distribution of values looks similar for all datasets. The effect of excluding the IEA code (
<italic>List3 </italic>
) or considering it alone (
<italic>List4 </italic>
) differs between the KEGG pathways and Pfam clans, i.e. between the BP and MF annotations. It also varies between the yeast and human datasets, reflecting the different ratios of IEA versus non-IEA annotations in these two species (see Figure
<xref ref-type="fig" rid="F1">1</xref>
and Table
<xref ref-type="table" rid="T2">2</xref>
and
<xref ref-type="table" rid="T3">3</xref>
). For the yeast KEGG pathways, the most striking variation is observed with
<italic>List4 </italic>
which gives a marked decrease in the number of values in the 0.8-1.0 class interval, and a significant number of missing values.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Influence of various EC weight lists on the distribution of pairwise similarity values obtained for intra-set similarity calculation</bold>
. KEGG pathway datasets are handled with BP annotations, and Pfam clans with MF annotations. The MV bar is for
<italic>Missing Values </italic>
and represents the number of pairwise similarity values that cannot be calculated using
<italic>List3 </italic>
or
<italic>List4 </italic>
due to the missing annotations for certain genes. Pairwise similarity intervals are displayed on the × axis of the histograms, while values on the y axis represent the number of pairwise similarity values present in each interval.</p>
</caption>
<graphic xlink:href="1471-2105-11-588-4"></graphic>
</fig>
<p>This means that for this dataset, using only IEA BP annotations yields generally lower similarity values and excludes from the calculation those genes without any IEA annotation (11 genes). This reflects the relatively high ratio (1.3) of non-IEA to IEA BP annotations in this dataset, and in yeast in general (2.0). A similar behavior is observed with the human KEGG pathways, not only with
<italic>List4 </italic>
but also with
<italic>List3</italic>
. The higher
<italic>Missing Values </italic>
bars in this dataset results from the high number of genes having either no IEA BP annotations (49 genes) or only IEA BP annotations (68 genes). This type of analysis shows that for such a dataset, IEA annotations are useful to capture intra-set pairwise similarity but they are not sufficient
<italic>per se</italic>
.</p>
<p>For the yeast and human Pfam clans, the distribution of values obtained with
<italic>List3 </italic>
is clearly shifted towards lower class intervals and
<italic>Missing Values </italic>
bars. This reflects the extent of the IEA MF annotations, and their important role in capturing intra-set similarity in these datasets (the ratios between non-IEA and IEA MF annotations are 0.3 and 0.46 for the yeast and human Pfam clan datasets, respectively). A total of 19 genes are annotated only with an IEA code in yeast, and 29 in human. Concerning
<italic>List4</italic>
, using only IEA MF annotations does not lead to large changes in the value distribution when compared to
<italic>List1 </italic>
and
<italic>List2</italic>
. This suggests that these annotations are sufficient to capture intra-set pairwise similarity for such datasets. However, a significant number of missing values is observed in the yeast Pfam clan dataset, with 20 genes lacking any IEA MF annotations.</p>
<p>In summary, using customized weight lists for evidence codes in the IntelliGO measure is a useful way to highlight the contribution that certain types of annotations make to similarity values, as shown with the Pfam clan datasets and IEA MF annotations. However, this contribution clearly depends on the dataset and on the considered GO aspect. Other weight lists may be worth considering if there is special interest for certain ECs in certain datasets. In this study, we decided to continue our experiments with
<italic>List2 </italic>
since this weight list expresses the commonly shared view about the relative importance of ECs for gene annotation.</p>
</sec>
<sec>
<title>2.4.4 Discriminating power</title>
<p>The third step of our evaluation consisted of testing the Discriminating Power (
<italic>DP </italic>
) of the
<italic>IntelliGO </italic>
similarity measure and comparing it with three other measures (Lord-normalized, Al-Mubaid, and SimGIC). The calculation of a discriminating power is introduced here to evaluate the ability of a similarity measure to distinguish between two functionally different sets of genes. The DP values for these three measures for the two benchmarking datasets are plotted in Figure
<xref ref-type="fig" rid="F5">5</xref>
and
<xref ref-type="fig" rid="F6">6</xref>
. For the KEGG pathways and BP annotations (Figure
<xref ref-type="fig" rid="F5">5</xref>
), the
<italic>IntelliGO </italic>
similarity measure produces
<italic>DP </italic>
values greater than or equal to 1.3 for each tested pathway, with a maximum of 2.43 for the
<italic>hsa</italic>
04130 pathway. In contrast, the
<italic>DP </italic>
values obtained with the normalized Lord measure oscillate around 1.0 (especially for the yeast pathways), which is not desirable. The Al-Mubaid and SimGIC measures generate rather heterogeneous DP values ranging between 1.0 and 2.5, and 0.2 and 2.3, respectively. Such heterogeneity indicates that the discriminative power of these measures is not as robust as the
<italic>IntelliGO </italic>
measure.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Comparison of the inter-set discriminating power of four similarity measures using KEGG pathways and BP annotations</bold>
. The DP values obtained with the
<italic>IntelliGO</italic>
, Lord-normalized, Al-Mubaid, and SimGIC similarity measures are plotted for each KEGG pathway (top panel for yeast and bottom panel for human).</p>
</caption>
<graphic xlink:href="1471-2105-11-588-5"></graphic>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Comparison of the inter-set discriminating power of four similarity measures using Pfam clans and MF annotations</bold>
. The DP values obtained with the
<italic>IntelliGO</italic>
, Lord-normalized, Al-Mubaid, and SimGIC similarity measures are plotted for each Pfam clan (yeast genes on top and human genes at bottom).</p>
</caption>
<graphic xlink:href="1471-2105-11-588-6"></graphic>
</fig>
<p>The results are very similar and even more favorable for the
<italic>IntelliGO </italic>
similarity measure when using Pfam clans and MF annotations as the benchmarking dataset (Figure
<xref ref-type="fig" rid="F6">6</xref>
). In this case, all of the
<italic>IntelliGO DP </italic>
values are greater than 1.5, and give a maximum of 5.4 for Pfam clan
<italic>CL</italic>
0255.6. The three other measures give either very non discriminative values (e.g. Lord-normalized for yeast Pfam clans) or quite heterogeneous profiles (all other values). Overall, these results indicate that the
<italic>IntelliGO </italic>
similarity measure has a remarkable ability to discriminate between distinct sets of genes. This provides strong evidence that this measure will be useful in gene clustering experiments.</p>
</sec>
<sec>
<title>2.4.5 Evaluation with the CESSM tool</title>
<p>A complementary evaluation was performed using the recent Collaborative Evaluation of GO-based Semantic Similarity Measures (CESSM) tool. This on-line tool [
<xref ref-type="bibr" rid="B51">51</xref>
] enables the comparison of a given measure with previously published measures on the basis of their correlation with sequence, Pfam, and Enzyme Classification similarities [
<xref ref-type="bibr" rid="B52">52</xref>
]. It uses a dataset of 13,430 protein pairs involving 1,039 proteins from various species. These protein pairs are characterized by their sequence similarity value, their number of common Pfam domains and their degree of relatedness in the Enzyme Classification, leading to the so-called
<italic>SeqSim</italic>
,
<italic>Pfam </italic>
and
<italic>ECC </italic>
metrics. Semantic similarity values, calculated with various existing methods, are then analyzed against these three biological similarity indicators. The user is invited to upload the values calculated for the dataset with his own semantic similarity measure. The CESSM tool processes these values and returns the corresponding graphs, a table displaying the Pearson correlation coefficients calculated using the user's measure as well as 11 other reference measures along with calculated resolution values for each measure.</p>
<p>We present in Table
<xref ref-type="table" rid="T4">4</xref>
only the results obtained for correlation coefficients as they are the most useful for comparison purposes. The values obtained with the
<italic>IntelliGO </italic>
measure using the MF annotation and including or excluding GO terms with IEA evidence codes are shown in the last column. When the whole GO annotation is considered (first three lines), the correlation coefficients range from 0.40 for the SeqSim metrics to 0.65 for ECC metrics. The value obtained with the ECC metrics is higher than all other values reported for this comparison, the best being the SimUI measure (0.63). For the Pfam and SeqSim metrics, the correlation coefficients obtained with the
<italic>IntelliGO </italic>
measure are lower than the best values obtained from five and seven other measures, respectively, the best values being obtained from the SimGIC measure (0.63 and 0.71, respectively). When IEA annotations are excluded (the final three lines), the IntelliGO correlation coefficients are lower for the ECC and Pfam metrics, as observed with most other measures, but slightly higher for the SeqSim metrics. This limited increase or absence of decrease is observed with two other measures (SimUI, JA), whereas much larger increases are seen for the three Max variants of the Resnick, Lord, and Jaccard methods (RM, LM, JM). Hence, it appears that in this evaluation, the
<italic>IntelliGO </italic>
measure gives correlation values that are intermediate between those obtained with poor (RA, RM, LA, LM, JA, JM) and high (SimGIC, SimUI, RB, LB, JB) performance methods.</p>
<table-wrap id="T4" position="float">
<label>Table 4</label>
<caption>
<p>Evaluation results obtained with the CESSM evaluation tool.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" colspan="2">Metrics</th>
<th align="center" colspan="12">Method</th>
</tr>
<tr>
<th></th>
<th></th>
<th colspan="12">
<hr></hr>
</th>
</tr>
<tr>
<th></th>
<th></th>
<th align="center">SimGIC</th>
<th align="center">SimUI</th>
<th align="center">RA</th>
<th align="center">RM</th>
<th align="center">RB</th>
<th align="center">LA</th>
<th align="center">LM</th>
<th align="center">LB</th>
<th align="center">JA</th>
<th align="center">JM</th>
<th align="center">JB</th>
<th align="center">IntelliGO</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td align="center">ECC</td>
<td align="center">0.62</td>
<td align="center">0.63</td>
<td align="center">0.39</td>
<td align="center">0.45</td>
<td align="center">0.60</td>
<td align="center">0.42</td>
<td align="center">0.45</td>
<td align="center">0.64</td>
<td align="center">0.34</td>
<td align="center">0.36</td>
<td align="center">0.56</td>
<td align="center">0.65</td>
</tr>
<tr>
<td></td>
<td colspan="13">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">Pfam</td>
<td align="center">0.63</td>
<td align="center">0.61</td>
<td align="center">0.44</td>
<td align="center">0.18</td>
<td align="center">0.57</td>
<td align="center">0.44</td>
<td align="center">0.18</td>
<td align="center">0.56</td>
<td align="center">0.33</td>
<td align="center">0.12</td>
<td align="center">0.49</td>
<td align="center">0.48</td>
</tr>
<tr>
<td></td>
<td colspan="13">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">All EC</td>
<td align="center">SeqSim</td>
<td align="center">0.71</td>
<td align="center">0.59</td>
<td align="center">0.50</td>
<td align="center">0.12</td>
<td align="center">0.66</td>
<td align="center">0.46</td>
<td align="center">0.12</td>
<td align="center">0.60</td>
<td align="center">0.29</td>
<td align="center">0.10</td>
<td align="center">0.54</td>
<td align="center">0.40</td>
</tr>
<tr>
<td colspan="14">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">ECC</td>
<td align="center">0.58</td>
<td align="center">0.57</td>
<td align="center">0.37</td>
<td align="center">0.47</td>
<td align="center">0.48</td>
<td align="center">0.38</td>
<td align="center">0.51</td>
<td align="center">0.51</td>
<td align="center">0.37</td>
<td align="center">0.46</td>
<td align="center">0.51</td>
<td align="center">0.48</td>
</tr>
<tr>
<td></td>
<td colspan="13">
<hr></hr>
</td>
</tr>
<tr>
<td></td>
<td align="center">Pfam</td>
<td align="center">0.58</td>
<td align="center">0.55</td>
<td align="center">0.43</td>
<td align="center">0.44</td>
<td align="center">0.52</td>
<td align="center">0.42</td>
<td align="center">0.42</td>
<td align="center">0.51</td>
<td align="center">0.33</td>
<td align="center">0.34</td>
<td align="center">0.45</td>
<td align="center">0.43</td>
</tr>
<tr>
<td></td>
<td colspan="13">
<hr></hr>
</td>
</tr>
<tr>
<td align="center">Non-IEA EC</td>
<td align="center">SeqSim</td>
<td align="center">0.66</td>
<td align="center">0.59</td>
<td align="center">0.46</td>
<td align="center">0.48</td>
<td align="center">0.65</td>
<td align="center">0.41</td>
<td align="center">0.40</td>
<td align="center">0.59</td>
<td align="center">0.31</td>
<td align="center">0.36</td>
<td align="center">0.52</td>
<td align="center">0.43</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Table 4: Pearson linear correlation coefficients are displayed for the ECC (Enzyme Classification Comparison), Pfam, and sequence similarity metrics (SeqSim). The
<italic>Molecular Function </italic>
GO annotation is used including (first three rows) or excluding (last three rows) annotation terms with IEA evidence codes. The column headings are listed as: SimUI: Union Intersection similarity; RA: Resnick Average; RM: Resnick Max; RB: Resnick Best match; LA: Lord Average; LM: Lord Max; LB: Lord Best match; JA: Jaccard Average; JM: Jaccard Max; JB: Jaccard Best match.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
</sec>
<sec>
<title>3 Discussion</title>
<p>Considering the growing number of semantic similarity measures, an important aspect of this study is the proposal of a method for estimating and comparing their performance. So far, rather heterogeneous and non-reproducible strategies have been used to validate new semantic similarity measures [
<xref ref-type="bibr" rid="B9">9</xref>
]. For example, it is generally assumed that gene products displaying sequence similarity should display similar MF annotations. This hypothesis was used by Lord
<italic>et al</italic>
. to evaluate their semantic similarity measure by exploring the correlation between gene annotation and sequence similarity in a set of human proteins [
<xref ref-type="bibr" rid="B2">2</xref>
]. They found a correlation between annotation and sequence similarity when using the MF aspect of GO annotations, and this was later confirmed by Schlicker
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B28">28</xref>
] using a different similarity measure. These authors also tested their similarity measure for clustering protein families from the Pfam database on the basis of their MF annotations. They showed that Pfam families with the same function did form rather well-defined clusters. In this frame of mind, the CESSM tool used in this study (Section 2.4.5) is a valuable initiative towards standardizing the evaluation of semantic similarity measures.</p>
<p>Another group of evaluation techniques relies on the hypothesis that genes displaying similar expression profile should share similar functions or participate in similar biological processes. This was used by Chabalier
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B41">41</xref>
] to validate their similarity measure. These authors were able to reconstitute networks of genes presenting high pairwise similarity based on BP annotations and to characterize at least some of these networks with a particular transcriptional behavior and/or some matching with relevant KEGG pathways.</p>
<p>Using pathways as established sets of genes displaying functional similarity has also become an accepted way to validate new similarity measures. The analyses performed by Guo
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B35">35</xref>
] showed that all pairs of proteins within KEGG human regulatory pathways have significantly higher similarity than expected by chance in terms of BP annotations. Wang
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B29">29</xref>
] and Al-Mubaid
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B26">26</xref>
] have tested their similarity measure on yeast genes belonging to some pathways extracted from the
<italic>Saccharomyces </italic>
Genome Database. In the former study, only MF annotations were considered and the authors' similarity measure led to clustering genes with similar functions within a pathway much more efficiently than a measure based on Resnik's similarity between GO terms. In the latter study, the values obtained for pairwise gene similarity using BP annotations within each studied pathway were also more consistent than those obtained with a measure based on Resnik's similarity.</p>
<p>In our study, two benchmarking datasets of KEGG pathways and Pfam clans were used to test the performance of the
<italic>IntelliGO </italic>
similarity measure. Expressions for intra-set similarity and inter-set discriminating power were defined to carry out this evaluation. The testing hypotheses used here are that genes in the same pathway or Pfam clan should share similar BP or MF annotations, respectively. These datasets contain 465 and 218 genes, respectively, which is less then the CESSM evaluation dataset (1,039 proteins). However, the calculation of intra-set and inter-set similarities led to 67,933 pairwise comparisons which is larger than in the CESSM dataset (13,340 protein pairs).</p>
</sec>
<sec>
<title>4 Conclusions and perspectives</title>
<p>This paper presents
<italic>IntelliGO</italic>
, a new vector-based semantic similarity measure for the functional comparison of genes. The
<italic>IntelliGO annotation </italic>
vector space model differs from others in both the weighting scheme used and the relationships defined between base vectors. The definition of this novel vector space model allows heterogeneous properties expressing the semantics of GO annotations (namely annotation frequency of GO terms, origin of GO annotations through evidence codes, and term-term relationships in the GO graph) to be integrated in a common framework. Moreover, the
<italic>IntelliGO </italic>
measure avoids some inconveniences encountered with other similarity measures such as the problem of aggregating at best term-term similarities. It also solves rigorously the problem of multiple depth values for GO terms in the GO rDAG structure. Furthermore, the effect of annotation heterogeneity across species is reduced when comparing genes within a given species thanks to the use of IAF coefficients which are constrained to the given species. Our results show that the
<italic>IntelliGO </italic>
similarity measure is robust since it copes with the variability of gene annotations in each considered set of genes, thereby providing consistent results such as an intra-set similarity value of at least 0.50 and a discriminative power of at least 1.3. Moreover, it has been shown that the
<italic>IntelliGO </italic>
similarity measure can use ECs to estimate the relative contributions of GO annotations to gene functional similarity. In future work, we intend to use our similarity measure in clustering experiments using hierarchical and
<italic>K-means </italic>
clustering of our benchmarking datasets. We will also test co-clustering approaches to compare functional clustering using
<italic>IntelliGO </italic>
with differential expression profiles [
<xref ref-type="bibr" rid="B24">24</xref>
,
<xref ref-type="bibr" rid="B53">53</xref>
].</p>
</sec>
<sec>
<title>5 Methods</title>
<p>The C++ programming language was used for developing all programs. The extraction of the
<italic>LCAs </italic>
and the
<italic>SPLs </italic>
of pairs of GO terms was performed by querying the GO relational database with the
<italic>AmiGO </italic>
tool [
<xref ref-type="bibr" rid="B54">54</xref>
].</p>
<sec>
<title>5.1 Reference Similarity Measures</title>
<p>The four measures compared in our evaluation were implemented using the following definitions. Let
<italic>g</italic>
<sub>1 </sub>
and
<italic>g</italic>
<sub>2 </sub>
be two gene products represented by collections of GO terms
<italic>g</italic>
<sub>1</sub>
={
<italic>t</italic>
<sub>1,1</sub>
, ...,
<italic>t</italic>
<sub>1,
<italic>i</italic>
</sub>
, ...,
<italic>t</italic>
<sub>1,
<italic>n</italic>
</sub>
} and
<italic>g</italic>
<sub>2 </sub>
= {
<italic>t</italic>
<sub>2,1</sub>
, ...,
<italic>t</italic>
<sub>2,
<italic>i</italic>
</sub>
, ...,
<italic>t</italic>
<sub>2,
<italic>m</italic>
</sub>
}. The first measure is Lord's similarity measure [
<xref ref-type="bibr" rid="B2">2</xref>
], which is based on Resnik's pairwise term similarity. For each pair of terms,
<italic>t
<sub>i </sub>
</italic>
and
<italic>t
<sub>j</sub>
</italic>
, the Resnik measure is defined as the information content (IC) of their
<italic>LCA</italic>
:</p>
<p>
<disp-formula id="bmcM21">
<label>(21)</label>
<mml:math id="M40" name="1471-2105-11-588-i39" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>A</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Then, the Lord similarity measure between
<italic>g</italic>
<sub>1 </sub>
and
<italic>g</italic>
<sub>2 </sub>
is calculated as the average of the Resnik similarity values obtained for all pairs of annotation terms:</p>
<p>
<disp-formula id="bmcM22">
<label>(22)</label>
<mml:math id="M41" name="1471-2105-11-588-i40" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>m</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>*</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Because this measure yields values greater than 1.0, we normalize the values obtained for a set of genes or for a collection of sets by dividing by the maximal value.</p>
<p>The second measure used here was introduced by Al-Mubaid
<italic>et al</italic>
. [
<xref ref-type="bibr" rid="B31">31</xref>
]. This method first calculates the shortest path length (PL) matrix between all pairs of GO terms annotating the two genes, i.e. PL (
<italic>t</italic>
<sub>1,
<italic>i</italic>
</sub>
,
<italic>t</italic>
<sub>2,
<italic>j</italic>
</sub>
), ∀i ∈[1,n], ∀j ∈[1,m]. It then calculates the average of all PL values in the matrix, which represents the average PL between the two gene products:</p>
<p>
<disp-formula id="bmcM23">
<label>(23)</label>
<mml:math id="M42" name="1471-2105-11-588-i41" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>m</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>*</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Finally, a transfer function is applied to this PL value to convert it into a similarity value. As this similarity monotonically decreases when the PL increases, the similarity value is obtained by:</p>
<p>
<disp-formula id="bmcM24">
<label>(24)</label>
<mml:math id="M43" name="1471-2105-11-588-i42" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>b</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mo></mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>*</mml:mo>
<mml:mi>P</mml:mi>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>with
<italic>f </italic>
= 0.2 according to the authors.</p>
<p>The third measure used here is the classical weighted-cosine measure, whereby each gene is represented by its annotation vector in an orthogonal VSM. Each component represents a GO term and is weighted by its own IAF value (
<italic>w
<sub>i </sub>
</italic>
=
<italic>IAF </italic>
(
<italic>t
<sub>i</sub>
</italic>
)) if the term annotates the gene, otherwise the weight is set to 0.0. Then, the weighted-cosine measure is defined in (20) but with the classical dot product expression in which
<inline-formula>
<mml:math id="M44" name="1471-2105-11-588-i43" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>g</mml:mi>
<mml:mo></mml:mo>
</mml:mover>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo></mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</inline-formula>
, for all terms
<italic>t
<sub>i </sub>
</italic>
present in both annotation vectors.</p>
<p>The last measure used here is SimGIC (Graph Information Content), which is also known as the Weighted Jaccard measure [
<xref ref-type="bibr" rid="B36">36</xref>
]. This measure is available in the
<italic>csbl.go </italic>
package within R Bioconductor [
<xref ref-type="bibr" rid="B55">55</xref>
], [
<xref ref-type="bibr" rid="B56">56</xref>
]. Given two gene products
<italic>g</italic>
<sub>1 </sub>
and
<italic>g</italic>
<sub>2 </sub>
represented by their two extended annotation sets (terms plus ancestors), the semantic similarity between these two gene products is calculated as the ratio between the sum of the information contents of GO terms in the intersection and the sum of the information contents of GO terms in the union:</p>
<p>
<disp-formula id="bmcM25">
<label>(25)</label>
<mml:math id="M45" name="1471-2105-11-588-i44" overflow="scroll">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>G</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
<sec>
<title>5.2 Intra-Set and Inter-Set Similarity</title>
<p>Consider
<italic>S</italic>
, a collection of sets of genes where
<italic>S </italic>
= {
<italic>S</italic>
<sub>1</sub>
,
<italic>S</italic>
<sub>2</sub>
, ...,
<italic>S
<sub>i</sub>
</italic>
} (a set
<italic>S
<sub>k </sub>
</italic>
can be a KEGG pathway or a Pfam clan). For each set
<italic>S
<sub>k</sub>
</italic>
, let {
<italic>g</italic>
<sub>
<italic>k</italic>
1</sub>
,
<italic>g</italic>
<sub>
<italic>k</italic>
2</sub>
, .....,
<italic>g
<sub>kn</sub>
</italic>
} be the set of
<italic>n </italic>
genes comprised in
<italic>S
<sub>k</sub>
</italic>
. Let
<italic>Sim</italic>
(
<italic>g</italic>
,
<italic>h</italic>
) be a similarity measure between genes g and h. The intra-set similarity value is defined for a given set of genes
<italic>S
<sub>k </sub>
</italic>
by:</p>
<p>
<disp-formula id="bmcM26">
<label>(26)</label>
<mml:math id="M46" name="1471-2105-11-588-i45" overflow="scroll">
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>n</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>For two sets of genes
<italic>S
<sub>k </sub>
</italic>
and
<italic>S
<sub>l </sub>
</italic>
composed of
<italic>n </italic>
and
<italic>m </italic>
genes respectively, we define the inter-set similarity value by:</p>
<p>
<disp-formula id="bmcM27">
<label>(27)</label>
<mml:math id="M47" name="1471-2105-11-588-i46" overflow="scroll">
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>l</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>m</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>*</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>Note that when the
<italic>Sim </italic>
function takes values in the interval 0[
<xref ref-type="bibr" rid="B1">1</xref>
], so do the
<italic>Intra_Set_Sim </italic>
and
<italic>Inter_Set_Sim </italic>
functions. Finally, for a given collection
<italic>S </italic>
composed of
<italic>p </italic>
sets of genes, the discriminative power of the semantic similarity measure
<italic>Sim </italic>
with respect to a given set
<italic>S
<sub>k </sub>
</italic>
in
<italic>S </italic>
will be defined as:</p>
<p>
<disp-formula id="bmcM28">
<label>(28)</label>
<mml:math id="M48" name="1471-2105-11-588-i47" overflow="scroll">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>I</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mi>p</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mo></mml:mo>
</mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
</sec>
<sec>
<title>Authors' contributions</title>
<p>SB, MD and MS designed the IntelliGO similarity measure. SB developed and implemented the method. MD and MS evaluated the biological results. MD, MS, OP and AN supervised the study, making significant contributions. SB, MD and MS wrote the paper. All authors proofread and approved the final manuscript.</p>
</sec>
</body>
<back>
<sec>
<title>Acknowledgements</title>
<p>This work was supported by the French National Institute of Cancer (INCa) and by the Region Lorraine program for research and technology (CPER MISN). SB is a recipient of an INCa doctoral fellowship. Special thanks to Dave Ritchie for careful reading of the manuscript.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ball</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Blake</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Botstein</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Butler</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Cherry</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Dolinski</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Dwight</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Eppig</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Harris</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hill</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Issel-Tarver</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Kasarskis</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Matese</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Richardson</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Ringwald</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Sherlock</surname>
<given-names>G</given-names>
</name>
<article-title>Gene Ontology: tool for the unification of biology</article-title>
<source>Nature Genetics</source>
<year>2000</year>
<volume>25</volume>
<fpage>25</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="doi">10.1038/75556</pub-id>
<pub-id pub-id-type="pmid">10802651</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Lord</surname>
<given-names>PW</given-names>
</name>
<name>
<surname>Stevens</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Brass</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Goble</surname>
<given-names>CA</given-names>
</name>
<article-title>Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>10</issue>
<fpage>1275</fpage>
<lpage>1283</lpage>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/10/1275">http://bioinformatics.oxfordjournals.org/cgi/content/abstract/19/10/1275</ext-link>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg153</pub-id>
<pub-id pub-id-type="pmid">12835272</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Consortium</surname>
<given-names>TGO</given-names>
</name>
<article-title>The Gene Ontology in 2010: extensions and refinements</article-title>
<source>Nucl Acids Res</source>
<year>2010</year>
<volume>38</volume>
<issue>suppl 1</issue>
<fpage>D331</fpage>
<lpage>335</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkp1018</pub-id>
<pub-id pub-id-type="pmid">19920128</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Barrell</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Dimmer</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Huntley</surname>
<given-names>RP</given-names>
</name>
<name>
<surname>Binns</surname>
<given-names>D</given-names>
</name>
<name>
<surname>O'Donovan</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Apweiler</surname>
<given-names>R</given-names>
</name>
<article-title>The GOA database in 2009-an integrated Gene Ontology Annotation resource</article-title>
<source>Nucl Acids Res</source>
<year>2009</year>
<volume>37</volume>
<issue>suppl 1</issue>
<fpage>D396</fpage>
<lpage>403</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn803</pub-id>
<pub-id pub-id-type="pmid">18957448</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="journal">
<name>
<surname>Khatri</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Draghici</surname>
<given-names>S</given-names>
</name>
<article-title>Ontological analysis of gene expression data: current tools, limitations, and open problems</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<issue>18</issue>
<fpage>3587</fpage>
<lpage>3595</lpage>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/21/18/3587">http://bioinformatics.oxfordjournals.org/cgi/content/abstract/21/18/3587</ext-link>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti565</pub-id>
<pub-id pub-id-type="pmid">15994189</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="journal">
<name>
<surname>Huang</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Sherman</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Collins</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Alvord</surname>
<given-names>WG</given-names>
</name>
<name>
<surname>Roayaei</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Baseler</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Lane</surname>
<given-names>HC</given-names>
</name>
<name>
<surname>Lempicki</surname>
<given-names>R</given-names>
</name>
<article-title>The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists</article-title>
<source>Genome Biology</source>
<year>2007</year>
<volume>8</volume>
<issue>9</issue>
<fpage>R183</fpage>
<ext-link ext-link-type="uri" xlink:href="http://genomebiology.com/2007/8/9/R183">http://genomebiology.com/2007/8/9/R183</ext-link>
<pub-id pub-id-type="doi">10.1186/gb-2007-8-9-r183</pub-id>
<pub-id pub-id-type="pmid">17784955</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>Beissbarth</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Speed</surname>
<given-names>TP</given-names>
</name>
<article-title>GOstat: find statistically overrepresented Gene Ontologies within a group of genes</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<issue>9</issue>
<fpage>1464</fpage>
<lpage>1465</lpage>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/9/1464">http://bioinformatics.oxfordjournals.org/cgi/content/abstract/20/9/1464</ext-link>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bth088</pub-id>
<pub-id pub-id-type="pmid">14962934</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="other">
<name>
<surname>Speer</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Spieth</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Zell</surname>
<given-names>A</given-names>
</name>
<article-title>A Memetic Co-Clustering Algorithm for Gene Expression Profiles and Biological Annotation</article-title>
<year>2004</year>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>Pesquita</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Faria</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Falcão</surname>
<given-names>AO</given-names>
</name>
<name>
<surname>Lord</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Couto</surname>
<given-names>FM</given-names>
</name>
<article-title>Semantic Similarity in Biomedical Ontologies</article-title>
<source>PLoS Comput Biol</source>
<year>2009</year>
<volume>5</volume>
<issue>7</issue>
<fpage>e1000443</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1000443</pub-id>
<pub-id pub-id-type="pmid">19649320</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Rogers</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Ben-Hur</surname>
<given-names>A</given-names>
</name>
<article-title>The use of gene ontology evidence codes in preventing classifier assessment bias</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>9</issue>
<fpage>1173</fpage>
<lpage>1177</lpage>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/9/1173">http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/9/1173</ext-link>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp122</pub-id>
<pub-id pub-id-type="pmid">19254922</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="other">
<article-title>The Gene Ontology Evidence Tree</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.geneontology.org/GO.evidence.tree.shtml">http://www.geneontology.org/GO.evidence.tree.shtml</ext-link>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="other">
<name>
<surname>Du</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>CF</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>PS</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>JZ</given-names>
</name>
<article-title>G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery</article-title>
<source>Nucl Acids Res</source>
<year>2009</year>
<ext-link ext-link-type="uri" xlink:href="http://nar.oxfordjournals.org/cgi/content/abstract/gkp463v1">http://nar.oxfordjournals.org/cgi/content/abstract/gkp463v1</ext-link>
<comment>gkp463</comment>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Popescu</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Keller</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>JA</given-names>
</name>
<article-title>Fuzzy Measures on the Gene Ontology for Gene Product Similarity</article-title>
<source>IEEE/ACM Transactions on Computational Biology and Bioinformatics</source>
<year>2006</year>
<volume>3</volume>
<issue>3</issue>
<fpage>263</fpage>
<lpage>274</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2006.37</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="journal">
<name>
<surname>Ganesan</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Garcia-Molina</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Widom</surname>
<given-names>J</given-names>
</name>
<article-title>Exploiting hierarchical domain structure to compute similarity</article-title>
<source>ACM Trans Inf Syst</source>
<year>2003</year>
<volume>21</volume>
<fpage>64</fpage>
<lpage>93</lpage>
<pub-id pub-id-type="doi">10.1145/635484.635487</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="other">
<name>
<surname>Blanchard</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Harzallah</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kuntz</surname>
<given-names>P</given-names>
</name>
<article-title>A generic framework for comparing semantic similarities on a subsumption hierarchy</article-title>
<source>18th European Conference on Artificial Intelligence (ECAI)</source>
<year>2008</year>
<fpage>20</fpage>
<lpage>24</lpage>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Tversky</surname>
<given-names>A</given-names>
</name>
<article-title>Features of similarity</article-title>
<source>Psychological Review</source>
<year>1977</year>
<volume>84</volume>
<fpage>327</fpage>
<lpage>352</lpage>
<pub-id pub-id-type="doi">10.1037/0033-295X.84.4.327</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Lee</surname>
<given-names>WN</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sundlass</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Musen</surname>
<given-names>M</given-names>
</name>
<article-title>Comparison of Ontology-based Semantic-Similarity Measures</article-title>
<source>AMIA Annu Symp Proceedings</source>
<year>2008</year>
<volume>V2008</volume>
<fpage>384</fpage>
<lpage>388</lpage>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="other">
<name>
<surname>Resnik</surname>
<given-names>P</given-names>
</name>
<article-title>Using Information Content to Evaluate Semantic Similarity in a Taxonomy</article-title>
<source>IJCAI</source>
<year>1995</year>
<fpage>448</fpage>
<lpage>453</lpage>
<ext-link ext-link-type="uri" xlink:href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.5277">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.5277</ext-link>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="other">
<name>
<surname>Jiang</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Conrath</surname>
<given-names>DW</given-names>
</name>
<article-title>Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy</article-title>
<source>International Conference Research on Computational Linguistics (ROCLING X)</source>
<year>1997</year>
<fpage>9008+</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.bibsonomy.org/bibtex/2c4ffc507dafc908eab62fde53f7e4f7a/sdo">http://www.bibsonomy.org/bibtex/2c4ffc507dafc908eab62fde53f7e4f7a/sdo</ext-link>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Miller</surname>
<given-names>GA</given-names>
</name>
<article-title>WordNet: A Lexical Database for English</article-title>
<source>Communications of the ACM</source>
<year>1995</year>
<volume>38</volume>
<fpage>39</fpage>
<lpage>41</lpage>
<pub-id pub-id-type="doi">10.1145/219717.219748</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="book">
<name>
<surname>Wu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Palmer</surname>
<given-names>M</given-names>
</name>
<article-title>Verbs semantics and lexical selection</article-title>
<source>Proceedings of the 32nd annual meeting on Association for Computational Linguistics</source>
<year>1994</year>
<publisher-name>Morristown, NJ, USA: Association for Computational Linguistics</publisher-name>
<fpage>133</fpage>
<lpage>138</lpage>
<comment>full_text</comment>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="book">
<name>
<surname>Lin</surname>
<given-names>D</given-names>
</name>
<article-title>An Information-Theoretic Definition of Similarity</article-title>
<source>ICML '98. Proceedings of the Fifteenth International Conference on Machine Learning</source>
<year>1998</year>
<publisher-name>San Francisco, CA, USA: Morgan Kaufmann Publishers Inc</publisher-name>
<fpage>296</fpage>
<lpage>304</lpage>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Sevilla</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Segura</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Podhorski</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Guruceaga</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Mato</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Martinez-Cruz</surname>
<given-names>LA</given-names>
</name>
<name>
<surname>Corrales</surname>
<given-names>FJ</given-names>
</name>
<name>
<surname>Rubio</surname>
<given-names>A</given-names>
</name>
<article-title>Correlation between Gene Expression and GO Semantic Similarity</article-title>
<source>IEEE/ACM Trans. Comput. Biol. Bioinformatics</source>
<year>2005</year>
<volume>2</volume>
<issue>4</issue>
<fpage>330</fpage>
<lpage>338</lpage>
<pub-id pub-id-type="doi">10.1109/TCBB.2005.50</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="other">
<name>
<surname>Brameier</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Wiuf</surname>
<given-names>C</given-names>
</name>
<article-title>Co-Clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cervisiae using self organizing maps</article-title>
<source>Biological Informatics</source>
<year>2007</year>
<issue>40</issue>
<fpage>160</fpage>
<lpage>173</lpage>
<pub-id pub-id-type="doi">10.1016/j.jbi.2006.05.001</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="journal">
<name>
<surname>Rada</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Mili</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Bicknell</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Blettner</surname>
<given-names>M</given-names>
</name>
<article-title>Development and application of a metric on semantic nets</article-title>
<source>Systems, Man and Cybernetics, IEEE Transactions on</source>
<year>1989</year>
<volume>19</volume>
<fpage>17</fpage>
<lpage>30</lpage>
<pub-id pub-id-type="doi">10.1109/21.24528</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="book">
<name>
<surname>Nagar</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Al-Mubaid</surname>
<given-names>H</given-names>
</name>
<article-title>A New Path Length Measure Based on GO for Gene Similarity with Evaluation using SGD Pathways</article-title>
<source>Proceedings of the 2008 21st IEEE International Symposium on Computer-Based Medical Systems (CBMS 08)</source>
<year>2008</year>
<publisher-name>Washington, DC, USA: IEEE Computer Society</publisher-name>
<fpage>590</fpage>
<lpage>595</lpage>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="other">
<name>
<surname>Floridi</surname>
<given-names>L</given-names>
</name>
<article-title>Outiline of a Theory of Strongly Semantic Information</article-title>
<source>Minds Mach</source>
<year>2004</year>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="journal">
<name>
<surname>Schlicker</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Domingues</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Rahnenfuhrer</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lengauer</surname>
<given-names>T</given-names>
</name>
<article-title>A new measure for functional similarity of gene products based on Gene Ontology</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>302</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/7/302">http://www.biomedcentral.com/1471-2105/7/302</ext-link>
<pub-id pub-id-type="doi">10.1186/1471-2105-7-302</pub-id>
<pub-id pub-id-type="pmid">16776819</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="journal">
<name>
<surname>Wang</surname>
<given-names>JZ</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Payattakool</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>PS</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>CF</given-names>
</name>
<article-title>A new method to measure the semantic similarity of GO terms</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<issue>10</issue>
<fpage>1274</fpage>
<lpage>1281</lpage>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/10/1274">http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/10/1274</ext-link>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btm087</pub-id>
<pub-id pub-id-type="pmid">17344234</pub-id>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Othman</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Deris</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Illias</surname>
<given-names>RM</given-names>
</name>
<article-title>A genetic similarity algorithm for searching the Gene Ontology terms and annotating anonymous protein sequences</article-title>
<source>J of Biomedical Informatics</source>
<year>2008</year>
<volume>41</volume>
<fpage>65</fpage>
<lpage>81</lpage>
<pub-id pub-id-type="doi">10.1016/j.jbi.2007.05.010</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="other">
<name>
<surname>Nagar</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Al-Mubaid</surname>
<given-names>H</given-names>
</name>
<article-title>Using path length measure for gene clustering based on similarity of annotation terms</article-title>
<source>Computers and Communications, 2008. ISCC 2008. IEEE Symposium on</source>
<year>2008</year>
<fpage>637</fpage>
<lpage>642</lpage>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="journal">
<name>
<surname>Martin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Brun</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Remy</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Mouren</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Thieffry</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Jacq</surname>
<given-names>B</given-names>
</name>
<article-title>GOToolBox: functional analysis of gene datasets based on Gene Ontology</article-title>
<source>Genome Biology</source>
<year>2004</year>
<volume>5</volume>
<issue>12</issue>
<ext-link ext-link-type="uri" xlink:href="http://genomebiology.com/2004/5/12/R101">http://genomebiology.com/2004/5/12/R101</ext-link>
<pub-id pub-id-type="doi">10.1186/gb-2004-5-12-r101</pub-id>
<pub-id pub-id-type="pmid">15575967</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="journal">
<name>
<surname>Mistry</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pavlidis</surname>
<given-names>P</given-names>
</name>
<article-title>Gene Ontology term overlap as a measure of gene functional similarity</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<fpage>327</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/9/327">http://www.biomedcentral.com/1471-2105/9/327</ext-link>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-327</pub-id>
<pub-id pub-id-type="pmid">18680592</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="other">
<article-title>The Bioconductor GOstats package</article-title>
<ext-link ext-link-type="uri" xlink:href="http://bioconductor.org/packages/2.5/bioc/vignettes/GOstats/inst/doc/GOvis.pdf">http://bioconductor.org/packages/2.5/bioc/vignettes/GOstats/inst/doc/GOvis.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Guo</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Shriver</surname>
<given-names>CD</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Liebman</surname>
<given-names>MN</given-names>
</name>
<article-title>Assessing semantic similarity measures for the characterization of human regulatory pathways</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<issue>8</issue>
<fpage>967</fpage>
<lpage>973</lpage>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/8/967">http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/8/967</ext-link>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl042</pub-id>
<pub-id pub-id-type="pmid">16492685</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal">
<name>
<surname>Pesquita</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Faria</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bastos</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Ferreira</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Falcão</surname>
<given-names>AO</given-names>
</name>
<name>
<surname>Couto</surname>
<given-names>F</given-names>
</name>
<article-title>Metrics for GO based protein semantic similarity: a systematic evaluation</article-title>
<source>BMC Bioinformatics</source>
<year>2008</year>
<volume>9</volume>
<issue>Suppl 5</issue>
<fpage>S4</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/9/S5/S4">http://www.biomedcentral.com/1471-2105/9/S5/S4</ext-link>
<pub-id pub-id-type="doi">10.1186/1471-2105-9-S5-S4</pub-id>
<pub-id pub-id-type="pmid">18460186</pub-id>
</mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="book">
<name>
<surname>Salton</surname>
<given-names>G</given-names>
</name>
<name>
<surname>McGill</surname>
<given-names>MJ</given-names>
</name>
<source>Introduction to Modern Information Retrieval</source>
<year>1983</year>
<publisher-name>McGraw-Hill</publisher-name>
</mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="other">
<name>
<surname>Polettini</surname>
<given-names>N</given-names>
</name>
<article-title>The Vector Space Model in Information Retrieval-Term Weighting Problem</article-title>
<year>2004</year>
</mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="other">
<name>
<surname>Bodenreider</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Aubry</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Burgun</surname>
<given-names>A</given-names>
</name>
<article-title>Non-lexical approaches to identifying associative relations in the gene ontology</article-title>
<source>The Gene Ontology. PBS 2005</source>
<year>2005</year>
<fpage>91</fpage>
<lpage>102</lpage>
</mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="other">
<name>
<surname>Glenisson</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Antal</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Mathys</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Moor</surname>
<given-names>BD</given-names>
</name>
<article-title>Evaluation Of The Vector Space Representation In Text-Based Gene Clustering</article-title>
<source>Proc of the Eighth Ann Pac Symp Biocomp (PSB 2003)</source>
<year>2003</year>
<fpage>391</fpage>
<lpage>402</lpage>
</mixed-citation>
</ref>
<ref id="B41">
<mixed-citation publication-type="journal">
<name>
<surname>Chabalier</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Mosser</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Burgun</surname>
<given-names>A</given-names>
</name>
<article-title>A transversal approach to predict gene product networks from ontology-based similarity</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>235</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/1471-2105/8/235">http://www.biomedcentral.com/1471-2105/8/235</ext-link>
<pub-id pub-id-type="doi">10.1186/1471-2105-8-235</pub-id>
<pub-id pub-id-type="pmid">17605807</pub-id>
</mixed-citation>
</ref>
<ref id="B42">
<mixed-citation publication-type="journal">
<name>
<surname>Wright</surname>
<given-names>CC</given-names>
</name>
<article-title>The kappa statistic in reliability studies: use, interpretation, and sample size requirements</article-title>
<source>Physical Therapy</source>
<year>2005</year>
<volume>85</volume>
<issue>3</issue>
<fpage>257</fpage>
<lpage>268</lpage>
<pub-id pub-id-type="pmid">15733050</pub-id>
</mixed-citation>
</ref>
<ref id="B43">
<mixed-citation publication-type="other">
<name>
<surname>Blott</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Camous</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Gurrin</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>GJF</given-names>
</name>
<name>
<surname>Smeaton</surname>
<given-names>AF</given-names>
</name>
<article-title>On the use of Clustering and the MeSH Controlled Vocabulary to Improve MEDLINE Abstract Search</article-title>
<source>CORIA</source>
<year>2005</year>
<fpage>41</fpage>
<lpage>56</lpage>
</mixed-citation>
</ref>
<ref id="B44">
<mixed-citation publication-type="journal">
<name>
<surname>Couto</surname>
<given-names>FM</given-names>
</name>
<name>
<surname>Silva</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Coutinho</surname>
<given-names>PM</given-names>
</name>
<article-title>Measuring semantic similarity between Gene Ontology terms</article-title>
<source>Data Knowl Eng</source>
<year>2007</year>
<volume>61</volume>
<fpage>137</fpage>
<lpage>152</lpage>
<pub-id pub-id-type="doi">10.1016/j.datak.2006.05.003</pub-id>
</mixed-citation>
</ref>
<ref id="B45">
<mixed-citation publication-type="other">
<article-title>The NCBI gene2go file</article-title>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz">ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz</ext-link>
</mixed-citation>
</ref>
<ref id="B46">
<mixed-citation publication-type="other">
<article-title>The AMIGO database</article-title>
<ext-link ext-link-type="uri" xlink:href="http://amigo.geneontology.org">http://amigo.geneontology.org</ext-link>
</mixed-citation>
</ref>
<ref id="B47">
<mixed-citation publication-type="other">
<article-title>The KEGG Pathways database</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.genome.jp/kegg/pathway.html">http://www.genome.jp/kegg/pathway.html</ext-link>
</mixed-citation>
</ref>
<ref id="B48">
<mixed-citation publication-type="other">
<article-title>The DBGET database retrieval system</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.genome.jp/dbget/">http://www.genome.jp/dbget/</ext-link>
</mixed-citation>
</ref>
<ref id="B49">
<mixed-citation publication-type="other">
<article-title>The Sanger Pfam database</article-title>
<ext-link ext-link-type="uri" xlink:href="http://pfam.sanger.ac.uk">http://pfam.sanger.ac.uk</ext-link>
</mixed-citation>
</ref>
<ref id="B50">
<mixed-citation publication-type="other">
<article-title>The Uniprot database</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.uniprot.org/">http://www.uniprot.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B51">
<mixed-citation publication-type="other">
<article-title>The Collaborative Evaluation of Semantic Similarity Measures tool</article-title>
<ext-link ext-link-type="uri" xlink:href="http://xldb.di.fc.ul.pt/tools/cessm/">http://xldb.di.fc.ul.pt/tools/cessm/</ext-link>
</mixed-citation>
</ref>
<ref id="B52">
<mixed-citation publication-type="other">
<name>
<surname>Catia</surname>
</name>
<name>
<surname>Pessoa</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Faria</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Couto</surname>
<given-names>F</given-names>
</name>
<article-title>CESSM: Collaborative Evaluation of Semantic Similarity Measures</article-title>
<source>JB2009. Challenges in Bioinformatics</source>
<year>2009</year>
</mixed-citation>
</ref>
<ref id="B53">
<mixed-citation publication-type="book">
<name>
<surname>Benabderrahmane</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Devignes</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Smaïl Tabbone</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Poch</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Napoli</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Nguyen N-H</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Raffelsberger</surname>
<given-names>W</given-names>
</name>
<article-title>Analyse de données transcriptomiques: Modélisation floue de profils d'expression différentielle et analyse fonctionnelle</article-title>
<source>Actes du XXVIIième congrès Informatique des Organisations et Systèmes d'information et de décision - INFORSID 2009</source>
<year>2009</year>
<publisher-name>Toulouse France: IRIT-Toulouse</publisher-name>
<fpage>413</fpage>
<lpage>428</lpage>
<ext-link ext-link-type="uri" xlink:href="http://hal.inria.fr/inria-00394530/en/">http://hal.inria.fr/inria-00394530/en/</ext-link>
</mixed-citation>
</ref>
<ref id="B54">
<mixed-citation publication-type="journal">
<name>
<surname>Carbon</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ireland</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Mungall</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Shu</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Marshall</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>S</given-names>
</name>
<collab>the AmiGO Hub, the Web Presence Working Group</collab>
<article-title>AmiGO: online access to ontology and annotation data</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>2</issue>
<fpage>288</fpage>
<lpage>289</lpage>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/2/288">http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/2/288</ext-link>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn615</pub-id>
<pub-id pub-id-type="pmid">19033274</pub-id>
</mixed-citation>
</ref>
<ref id="B55">
<mixed-citation publication-type="other">
<article-title>The csbl.go package</article-title>
<ext-link ext-link-type="uri" xlink:href="http://csbi.ltdk.helsinki.fi/anduril/">http://csbi.ltdk.helsinki.fi/anduril/</ext-link>
</mixed-citation>
</ref>
<ref id="B56">
<mixed-citation publication-type="journal">
<name>
<surname>Ovaska</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Laakso</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hautaniemi</surname>
<given-names>S</given-names>
</name>
<article-title>Fast Gene Ontology based clustering for microarray experiments</article-title>
<source>BioData Mining</source>
<year>2008</year>
<volume>1</volume>
<fpage>11</fpage>
<ext-link ext-link-type="uri" xlink:href="http://www.biodatamining.org/content/1/1/11">http://www.biodatamining.org/content/1/1/11</ext-link>
<pub-id pub-id-type="doi">10.1186/1756-0381-1-11</pub-id>
<pub-id pub-id-type="pmid">19025591</pub-id>
</mixed-citation>
</ref>
<ref id="B57">
<mixed-citation publication-type="other">
<article-title>The Pfam_C October 2009 release file</article-title>
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam24.0/Pfam-C.gz">ftp://ftp.sanger.ac.uk/pub/databases/Pfam/releases/Pfam24.0/Pfam-C.gz</ext-link>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000075 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000075 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:3098105
   |texte=   IntelliGO: a new vector-based semantic similarity measure including annotation origin
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:21122125" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a InforLorV4 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022