Serveur d'exploration Tamazight

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0001800 ( Pmc/Corpus ); précédent : 0001799; suivant : 0001801 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Extending ontologies by finding siblings using set expansion techniques</title>
<author>
<name sortKey="Fabian, Gotz" sort="Fabian, Gotz" uniqKey="Fabian G" first="Götz" last="Fabian">Götz Fabian</name>
</author>
<author>
<name sortKey="W Chter, Thomas" sort="W Chter, Thomas" uniqKey="W Chter T" first="Thomas" last="W Chter">Thomas W Chter</name>
</author>
<author>
<name sortKey="Schroeder, Michael" sort="Schroeder, Michael" uniqKey="Schroeder M" first="Michael" last="Schroeder">Michael Schroeder</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">22689774</idno>
<idno type="pmc">3371847</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3371847</idno>
<idno type="RBID">PMC:3371847</idno>
<idno type="doi">10.1093/bioinformatics/bts215</idno>
<date when="2012">2012</date>
<idno type="wicri:Area/Pmc/Corpus">000180</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000180</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Extending ontologies by finding siblings using set expansion techniques</title>
<author>
<name sortKey="Fabian, Gotz" sort="Fabian, Gotz" uniqKey="Fabian G" first="Götz" last="Fabian">Götz Fabian</name>
</author>
<author>
<name sortKey="W Chter, Thomas" sort="W Chter, Thomas" uniqKey="W Chter T" first="Thomas" last="W Chter">Thomas W Chter</name>
</author>
<author>
<name sortKey="Schroeder, Michael" sort="Schroeder, Michael" uniqKey="Schroeder M" first="Michael" last="Schroeder">Michael Schroeder</name>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>
<bold>Motivation:</bold>
Ontologies are an everyday tool in biomedicine to capture and represent knowledge. However, many ontologies lack a high degree of coverage in their domain and need to improve their overall quality and maturity. Automatically extending sets of existing terms will enable ontology engineers to systematically improve text-based ontologies level by level.</p>
<p>
<bold>Results:</bold>
We developed an approach to extend ontologies by discovering new terms which are in a sibling relationship to existing terms of an ontology. For this purpose, we combined two approaches which retrieve new terms from the web. The first approach extracts siblings by exploiting the structure of HTML documents, whereas the second approach uses text mining techniques to extract siblings from unstructured text. Our evaluation against MeSH (Medical Subject Headings) shows that our method for sibling discovery is able to suggest first-class ontology terms and can be used as an initial step towards assessing the completeness of ontologies. The evaluation yields a recall of 80% at a precision of 61% where the two independent approaches are complementing each other. For MeSH in particular, we show that it can be considered complete in its medical focus area. We integrated the work into DOG4DAG, an ontology generation plugin for the editors OBO-Edit and Protégé, making it the first plugin that supports sibling discovery on-the-fly.</p>
<p>
<bold>Availability:</bold>
Sibling discovery for ontology is available as part of DOG4DAG (
<ext-link ext-link-type="uri" xlink:href="www.biotec.tu-dresden.de/research/schroeder/dog4dag">www.biotec.tu-dresden.de/research/schroeder/dog4dag</ext-link>
) for both Protégé 4.1 and OBO-Edit 2.1.</p>
<p>
<bold>Contact:</bold>
<email>ms@biotec.tu-dresden.de</email>
;
<email>goetz.fabian@biotec.tu-dresden.de</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary data</ext-link>
are available at Bioinformatics online.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Ashburner, M" uniqKey="Ashburner M">M. Ashburner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Balog, K" uniqKey="Balog K">K. Balog</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bodenreider, O" uniqKey="Bodenreider O">O. Bodenreider</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brunzel, M" uniqKey="Brunzel M">M. Brunzel</name>
</author>
<author>
<name sortKey="Spiliopoulou, M" uniqKey="Spiliopoulou M">M. Spiliopoulou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cote, R G" uniqKey="Cote R">R.G. Côté</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Day Richter, J" uniqKey="Day Richter J">J. Day-Richter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Doms, A" uniqKey="Doms A">A. Doms</name>
</author>
<author>
<name sortKey="Schroeder, M" uniqKey="Schroeder M">M. Schroeder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Etzioni, O" uniqKey="Etzioni O">O. Etzioni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Frantzi, K" uniqKey="Frantzi K">K. Frantzi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hearst, M" uniqKey="Hearst M">M. Hearst</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Howe, D" uniqKey="Howe D">D. Howe</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kozareva, Z" uniqKey="Kozareva Z">Z. Kozareva</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lin, D" uniqKey="Lin D">D. Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, K" uniqKey="Liu K">K. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ogren, P V" uniqKey="Ogren P">P.V. Ogren</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pantel, P" uniqKey="Pantel P">P. Pantel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pa Ca M" uniqKey="Pa Ca M">Paşca,M.</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schober, D" uniqKey="Schober D">D. Schober</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shi, S" uniqKey="Shi S">S. Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shi, S" uniqKey="Shi S">S. Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shinzato, K" uniqKey="Shinzato K">K. Shinzato</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="W Chter, T" uniqKey="W Chter T">T. Wächter</name>
</author>
<author>
<name sortKey="Schroeder, M" uniqKey="Schroeder M">M. Schroeder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, R" uniqKey="Wang R">R. Wang</name>
</author>
<author>
<name sortKey="Cohen, W" uniqKey="Cohen W">W. Cohen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Whetzel, P L" uniqKey="Whetzel P">P.L. Whetzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Whetzel, P L" uniqKey="Whetzel P">P.L. Whetzel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yao, L" uniqKey="Yao L">L. Yao</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, H" uniqKey="Zhang H">H. Zhang</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">22689774</article-id>
<article-id pub-id-type="pmc">3371847</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/bts215</article-id>
<article-id pub-id-type="publisher-id">bts215</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Ismb 2012 Proceedings Papers Committee July 15 to July 19, 2012, Long Beach, Ca, Usa</subject>
</subj-group>
<subj-group>
<subject>Original Papers</subject>
<subj-group>
<subject>Databases and Ontologies</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Extending ontologies by finding siblings using set expansion techniques</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Fabian</surname>
<given-names>Götz</given-names>
</name>
<xref ref-type="corresp" rid="COR1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wächter</surname>
<given-names>Thomas</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Schroeder</surname>
<given-names>Michael</given-names>
</name>
<xref ref-type="corresp" rid="COR1">
<sup>*</sup>
</xref>
</contrib>
</contrib-group>
<aff id="AFF1">Biotechnology Center (BIOTEC), Technische Universität Dresden, 01062 Dresden, Germany</aff>
<author-notes>
<corresp id="COR1">* To whom correspondence should be addressed.</corresp>
</author-notes>
<pub-date pub-type="ppub">
<day>15</day>
<month>6</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>9</day>
<month>6</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>9</day>
<month>6</month>
<year>2012</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>28</volume>
<issue>12</issue>
<fpage>i292</fpage>
<lpage>i300</lpage>
<permissions>
<copyright-statement>© The Author(s) 2012. Published by Oxford University Press.</copyright-statement>
<copyright-year>2012</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/3.0">http://creativecommons.org/licenses/by-nc/3.0</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>
<bold>Motivation:</bold>
Ontologies are an everyday tool in biomedicine to capture and represent knowledge. However, many ontologies lack a high degree of coverage in their domain and need to improve their overall quality and maturity. Automatically extending sets of existing terms will enable ontology engineers to systematically improve text-based ontologies level by level.</p>
<p>
<bold>Results:</bold>
We developed an approach to extend ontologies by discovering new terms which are in a sibling relationship to existing terms of an ontology. For this purpose, we combined two approaches which retrieve new terms from the web. The first approach extracts siblings by exploiting the structure of HTML documents, whereas the second approach uses text mining techniques to extract siblings from unstructured text. Our evaluation against MeSH (Medical Subject Headings) shows that our method for sibling discovery is able to suggest first-class ontology terms and can be used as an initial step towards assessing the completeness of ontologies. The evaluation yields a recall of 80% at a precision of 61% where the two independent approaches are complementing each other. For MeSH in particular, we show that it can be considered complete in its medical focus area. We integrated the work into DOG4DAG, an ontology generation plugin for the editors OBO-Edit and Protégé, making it the first plugin that supports sibling discovery on-the-fly.</p>
<p>
<bold>Availability:</bold>
Sibling discovery for ontology is available as part of DOG4DAG (
<ext-link ext-link-type="uri" xlink:href="www.biotec.tu-dresden.de/research/schroeder/dog4dag">www.biotec.tu-dresden.de/research/schroeder/dog4dag</ext-link>
) for both Protégé 4.1 and OBO-Edit 2.1.</p>
<p>
<bold>Contact:</bold>
<email>ms@biotec.tu-dresden.de</email>
;
<email>goetz.fabian@biotec.tu-dresden.de</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary data</ext-link>
are available at Bioinformatics online.</p>
</abstract>
<counts>
<page-count count="9"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="SEC1">
<title>1 INTRODUCTION</title>
<p>During the last decade, the field of biomedicine has seen a data explosion, made evident by the overwhelming number of published articles, databases, nucleotide sequences and protein structures (
<xref ref-type="bibr" rid="B11">Howe
<italic>et al</italic>
., 2008</xref>
). Today, ontologies are used extensively in the biomedical and healthcare sector for information and data integration, such as gene product annotation (
<xref ref-type="bibr" rid="B1">Ashburner
<italic>et al</italic>
., 2000</xref>
), analysis of high-throughput data (
<xref ref-type="bibr" rid="B24">Whetzel
<italic>et al</italic>
., 2006</xref>
) and searching (
<xref ref-type="bibr" rid="B7">Doms and Schroeder, 2005</xref>
). If an ontology cannot maintain a high degree of coverage in its domain, its correctness and integrity will suffer, leading to missing results when trying to find documents or genes semantically associated with terms (
<xref ref-type="bibr" rid="B14">Liu
<italic>et al</italic>
., 2011</xref>
). However, to keep up with new information, ontologies must be revised and newly added terms need to be enriched with definitions, cross-references and additional properties. Since ontologies are manually curated, developing and maintaining them is often a slow, tedious and error-prone process. To mitigate this bottleneck, text mining and related techniques can be employed to enrich ontologies in a semi-automated fashion. Among the variety of ontology learning methods proposed in the past, mainly term recognition and pattern-based relationship extraction methods are used in the biomedical field (
<xref ref-type="bibr" rid="B14">Liu
<italic>et al</italic>
., 2011</xref>
).</p>
<p>In this article, we present an alternative approach to enhancing ontologies by automatically finding suitable co-hyponyms of terms, i.e. finding terms which are in a sibling relationship to each other. This approach can be used to extend ontologies in a horizontal way and therefore to complete a set of terms. For instance, an ontology that already includes the terms
<italic>somatotrophs</italic>
and
<italic>trophoblasts</italic>
(which are both
<italic>endocrine cells</italic>
) could be extended by automatically proposing more terms with the same parent term (
<xref ref-type="fig" rid="F1">Fig. 1</xref>
). With this approach, ontology engineers can semi-automatically extend ontologies using two to three terms, which are the ‘seed terms’ for the algorithm. Many existing ontologies can be expanded in this way with minimal effort.
<fig id="F1" position="float">
<label>Fig. 1.</label>
<caption>
<p>Sibling discovery example:
<italic>Somatotrophs</italic>
,
<italic>Trophoblasts</italic>
, and
<italic>Thyrotrophs</italic>
are known child terms of
<italic>Endocrine Cells</italic>
in MeSH, other children such as
<italic>Gonadotrophs</italic>
,
<italic>Lactotrophs</italic>
can be automatically found using our approach</p>
</caption>
<graphic xlink:href="bts215f1"></graphic>
</fig>
</p>
<p>Our method extracts siblings of existing terms on-the-fly using web sites returned by queries to search engines, thus implicitly incorporating full-text journal articles, patents, text books, wiki pages, etc. as indexed by the engines. In our methodology, we are integrating two approaches, which, when combined efficiently, improve the quality of the proposed siblings in terms of precision and recall.</p>
<p>
<italic>Structure-based approach</italic>
: The first approach extracts siblings from the structure of web sites. It is based on the observation that terms, which are in a sibling relationship to each other, are often located together in tables, lists or headings. If seed terms are found in such elements, the remaining content of those elements has a high probability of being semantically related to the seed terms. For instance,
<xref ref-type="fig" rid="F2">Figure 2</xref>
shows an excerpt of the Wikipedia page on the endocrine system. When given
<italic>Somatotrophs</italic>
and
<italic>Gonadotrophs</italic>
(both endocrine cells) as seed terms, a third (possible) endocrine cell,
<italic>Corticotrophs</italic>
, can be extracted. We do this by exploiting the structure of HTML documents, which are prevalent on the web.
<fig id="F2" position="float">
<label>Fig. 2.</label>
<caption>
<p>Excerpt of HTML code from the Wikipedia page on the endocrine system [
<ext-link ext-link-type="uri" xlink:href="http://en.wikipedia.org/wiki/Endocrine_system">http://en.wikipedia.org/wiki/Endocrine_system</ext-link>
]</p>
</caption>
<graphic xlink:href="bts215f2"></graphic>
</fig>
</p>
<p>
<italic>Text-based approach</italic>
: The second approach finds candidate siblings from text by extracting them from enumerations in sentences. For instance, the following sentence contains an enumeration of endocrine cells: ‘…
<italic>several adenohypophysial endocrine cells such as somatotrophs</italic>
,
<italic>thyrotrophs</italic>
,
<italic>and gonadotrophs</italic>
’ [
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pubmed/11478270">http://www.ncbi.nlm.nih.gov/pubmed/11478270</ext-link>
]. In this sentence, the enumerated terms are
<italic>somatotrophs</italic>
,
<italic>thyrotrophs</italic>
and
<italic>gonadotrophs</italic>
, which are all
<italic>adenohypophysial endocrine cells</italic>
. These enumerations occur in many forms, but often have reoccurring patterns, which we can exploit. We can extract terms with great accuracy using morpho-syntactic pattern matching, meaning we analyze the pattern of the sentence and its enumerated terms and extract them subsequently. Further examples of extracted enumerations from sentences can be found in
<xref ref-type="table" rid="T1">Table 1</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S1</ext-link>
.
<table-wrap id="T1" position="float">
<label>Table 1.</label>
<caption>
<p>Examples of parsed website results (selected from the top 10 websites)</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Topic</th>
<th rowspan="1" colspan="1">Seed terms (from MeSH)</th>
<th rowspan="1" colspan="1">Extracted snippet</th>
<th rowspan="1" colspan="1">Discovered terms</th>
<th rowspan="1" colspan="1">Website</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Particles</td>
<td rowspan="1" colspan="1">Heavy Ions, Neutrons, Protons</td>
<td rowspan="1" colspan="1">A particle, such as an
<bold>electron</bold>
,
<bold>proton</bold>
, or
<bold>neutron</bold>
, having…</td>
<td rowspan="1" colspan="1">Electron</td>
<td rowspan="1" colspan="1">answers.com</td>
</tr>
<tr>
<td rowspan="1" colspan="1">GnRH</td>
<td rowspan="1" colspan="1">Goserelin, Nafarelink, Buserelin</td>
<td rowspan="1" colspan="1">GnRH agonist analogues such as
<bold>buserelin</bold>
,
<bold>goserelin</bold>
,
<bold>lupron</bold>
, and
<bold>decapeptyl</bold>
inhibit the action…</td>
<td rowspan="1" colspan="1">Lupron, decapeptyl</td>
<td rowspan="1" colspan="1">
<ext-link ext-link-type="uri" xlink:href="ncbi.nlm.nih.gov">ncbi.nlm.nih.gov</ext-link>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Berberideae species</td>
<td rowspan="1" colspan="1">Mahonia, Caulophyllum, Epimedium</td>
<td rowspan="1" colspan="1">…only included four genera (
<bold>Berberis</bold>
,
<bold>Epimedium</bold>
,
<bold>Mahonia</bold>
,
<bold>Vancouveria</bold>
), with the other…</td>
<td rowspan="1" colspan="1">Berberis, Vancouveria</td>
<td rowspan="1" colspan="1">
<ext-link ext-link-type="uri" xlink:href="righthealth.com">righthealth.com</ext-link>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Bacillus species</td>
<td rowspan="1" colspan="1">Bacillus cereus, Bacillus megaterium, Bacillus subtilis</td>
<td rowspan="1" colspan="1">Microorganisms of the Bacillus species include
<bold>Bacillus cereus</bold>
,
<bold>Bacillus mycoides</bold>
,
<bold>Bacillus subtilis</bold>
,
<bold>Bacillus anthracis</bold>
, and
<bold>Bacillus thuringiensis</bold>
.</td>
<td rowspan="1" colspan="1">Bacillus mycoides, Bacillus anthracis, Bacillus thuringiensis</td>
<td rowspan="1" colspan="1">
<ext-link ext-link-type="uri" xlink:href="freepatentsonline.com">freepatentsonline.com</ext-link>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">European countries</td>
<td rowspan="1" colspan="1">Netherlands, Finland, Austria</td>
<td rowspan="1" colspan="1">…Shakira's most successful song in Europe, where it topped many of the medium sized charts, including
<bold>Austria</bold>
,
<bold>Denmark</bold>
,
<bold>Finland</bold>
,
<bold>Norway</bold>
and
<bold>Sweden</bold>
.</td>
<td rowspan="1" colspan="1">Denmark, Norway, Sweden</td>
<td rowspan="1" colspan="1">
<ext-link ext-link-type="uri" xlink:href="en.wikipedia.org">en.wikipedia.org</ext-link>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Seed terms and discovered terms are printed bold in the extracted snippet.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>Finally, the results of both approaches are combined to obtain a single ranked list of terms.</p>
<p>Generating siblings from seed terms is done in an interactive manner and only takes a few seconds. We optimized our method for biomedical ontologies by adapting both approaches to the peculiarities of biomedical terminology. Nonetheless, the method is suitable for ontologies of other domains. Since this work has been integrated as part of the DOG4DAG plugin (
<xref ref-type="bibr" rid="B22">Wächter and Schroeder, 2010</xref>
) into OBO-Edit and now also into Protégé, ontologies for both the OBO and the OWL format can be enriched seamlessly using sibling discovery.</p>
<p>The rest of this article is organized as follows. First, we compare our method to previous work in sibling generation. Next, we describe the approach and evaluate it using MeSH (Medical Subject Headings), a widely used ontology, as well as the Text REtrieval Conference Entity List Completion (TREC ELC) task. Furthermore, we show how the DOG4DAG plugin can be leveraged to extend ontologies using sibling generation. Finally, we discuss our approach and the results and propose future work.</p>
</sec>
<sec id="SEC2">
<title>2 RELATED WORK</title>
<p>The domain of ontology learning, including many approaches employing the web as a corpus, is a field of intensive research. Sibling generation using set expansion has been discussed in a number of studies which include approaches exploiting textual patterns, the HTML structure of web pages and distributional similarity (DS) of terms.</p>
<p>A number of text-based approaches incorporate
<xref ref-type="bibr" rid="B10">Hearst patterns (1992</xref>
) to find parent--child relationships in free text using lexico-syntactic pattern matching. Sibling generation is included in KnowItAll (
<xref ref-type="bibr" rid="B8">Etzioni
<italic>et al</italic>
., 2005</xref>
), a generic information extraction engine for unsupervised named-entity extraction. Using search results from the web, facts, terms and relations are extracted using bootstrapped patterns.
<xref ref-type="bibr" rid="B19">Shi
<italic>et al</italic>
. (2008</xref>
) and
<xref ref-type="bibr" rid="B27">Zhang
<italic>et al</italic>
. (2009</xref>
) also find siblings with sentence patterns and predefined HTML tag patterns. However, our system works on arbitrary tags and is not restricted to specific tags for lists or tables.
<xref ref-type="bibr" rid="B17">Paşca (2004</xref>
) retrieves siblings from the web in an unsupervised manner using pattern learning and part-of-speech (POS) and noun phrase (NP) tagging. Candidate siblings are ranked based on co-occurrence frequency. Also,
<xref ref-type="bibr" rid="B12">Kozareva
<italic>et al</italic>
. (2008</xref>
) built a pattern-based system for learning specific semantic classes (e.g. countries or singers). Contrary to our approach, they only used one highly specific surface pattern and did not incorporate NP chunking to correctly separate NPs from each other.</p>
<p>Several systems have also been developed for a structure-based approach. The systems SEAL (
<xref ref-type="bibr" rid="B23">Wang and Cohen, 2007</xref>
) and XTREEM (
<xref ref-type="bibr" rid="B4">Brunzel and Spiliopoulou, 2006</xref>
) both exploit semi-structured HTML documents to expand sets using a number of given seed terms. Wang and Cohen presented SEAL, a system which expands seeds by querying search engines and automatically inducing wrappers for each web page. In XTREEM, semantic sibling associations are extracted from web pages by grouping paths in DOM trees which include seed terms. However, their system does not return a ranked list of candidate siblings, but rather sets of sibling clusters. KnowItAll was also extended to include a ‘List Extractor’ component which extracts facts by exploiting the HTML structure of web pages.
<xref ref-type="bibr" rid="B21">Shinzato
<italic>et al</italic>
. (2004</xref>
) also extract siblings from HTML documents and rank the candidate siblings using cosine similarity.</p>
<p>Other approaches exploit DS to find siblings in text by looking at the context of each term. For instance,
<xref ref-type="bibr" rid="B13">Lin
<italic>et al</italic>
. (2001</xref>
) generate sibling sets with an unsupervised algorithm on a newspaper corpus and on MEDLINE abstracts. Similarly,
<xref ref-type="bibr" rid="B16">Pantel
<italic>et al</italic>
. (2009</xref>
) expand sets of terms by DS in a semi-supervised approach using seed items for each set. In general, DS approaches generally yield lower performance than pattern-based approaches when extracting proper nouns (
<xref ref-type="bibr" rid="B20">Shi
<italic>et al</italic>
., 2010</xref>
).</p>
<p>None of the set expansion methods have effectively combined both approaches for on-the-fly sibling discovery. Furthermore, the presented systems usually do not have any background knowledge in form of an ontology and only take a number of seed terms as an input. In contrast, our method also takes the parent term and lexical variants such as synonyms and abbreviations into account. Additionally, we optimized our method for the peculiarities of biomedical terminology and ontologies.</p>
<p>In terms of integrating ontology learning tools into editors such as OBO-Edit (
<xref ref-type="bibr" rid="B6">Day-Richter
<italic>et al</italic>
., 2007</xref>
) or Protégé [
<ext-link ext-link-type="uri" xlink:href="http://protege.stanford.edu">http://protege.stanford.edu</ext-link>
], two plugins currently exist: DOG4DAG (
<xref ref-type="bibr" rid="B22">Wächter and Schroeder, 2010</xref>
) and TerMine (
<xref ref-type="bibr" rid="B9">Frantzi
<italic>et al</italic>
., 2000</xref>
). We extended our plugin DOG4DAG to become the first integrated tool that supports sibling generation so far.</p>
</sec>
<sec id="SEC3">
<title>3 METHODS</title>
<p>In this section, we present our 2-fold approach to sibling generation from a given set of seed terms using the web as a corpus. The whole pipeline is summarized in
<xref ref-type="fig" rid="F3">Figure 3</xref>
.
<fig id="F3" position="float">
<label>Fig. 3.</label>
<caption>
<p>Overview of the sibling generation pipeline. Using seed terms, candidate siblings are generated, which are then aggregated into a final candidate sibling list</p>
</caption>
<graphic xlink:href="bts215f3"></graphic>
</fig>
</p>
<p>To complete an existing set of terms with an identical parent, a subset of these terms is selected as seed terms. The parent term and the already existing siblings are also added to the input as well. In addition to the label of the term, its lexical variants (synonyms, abbreviations, etc.) are included in the query and used for ranking if available. Using these seed terms, we query search engines and use the results to retrieve candidate siblings.</p>
<sec id="SEC3.1">
<title>3.1 Structure-based approach</title>
<p>In the first approach, search engines are queried for web sites containing the seed terms. After downloading the web pages, their parse trees are generated, and candidate siblings are extracted by finding paths identical to those of the seed terms.</p>
<p>
<italic>Query engines</italic>
: Query search engines (currently Yahoo! and Bing/MSN Live) with the selected seed terms and retrieve the search results. The queries are constructed by concatenating seed terms (in quotation marks) by the
<monospace>AND</monospace>
operator, thus ensuring that both terms occur in the web page. If more than two seed terms are used, query all pairwise combinations. At present, both search engines return 50 search results for every query.</p>
<p>
<italic>Download pages</italic>
: Download the web pages of the search results, parse and convert them to valid XHTML documents using HTMLCleaner [
<ext-link ext-link-type="uri" xlink:href="http://htmlcleaner.sourceforge.net/">http://htmlcleaner.sourceforge.net/</ext-link>
]. This step is required because many web pages contain invalid syntax, e.g. missing closing tags. The result of this cleaning process is a parse tree. We can represent the structure of a web page in such a parse tree, whose nodes contain the tags (such as
<monospace><
<italic>td</italic>
></monospace>
) and their textual contents.</p>
<p>
<italic>Traverse parse tree</italic>
: Traverse the parse tree in a depth-first search and find nodes whose text contains a seed term. Build the paths from the root node to the nodes containing the term (e.g.
<monospace>
</monospace>) if the seed term was found inside a table data cell). Extract the terms from nodes which have the same parent node inside the parse tree and the identical HTML tag as the seed term node.</p>
<p>
<italic>Group candidates</italic>
: Group all extracted candidate siblings into a candidate sibling set. If the seed term is preceded or followed by a string (such as ‘function of
<italic>seed term</italic>
’) in the text, all candidate siblings are also required to include this string.</p>
</sec>
<sec id="SEC3.2">
<title>3.2 Text-based approach</title>
<p>The second approach uses textual patterns to extract candidate siblings from enumerations in text. By querying search engines, we retrieve text snippets on-the-fly, from which siblings are extracted, filtered and ranked.</p>
<p>
<italic>Pattern extraction and expansion</italic>
: We built a small, manually annotated corpus containing sentences with typical enumerations. Whenever a sentence is added to the corpus, the annotated sentences are preprocessed automatically. From each sentence, head terms, enumeration items and words in between are extracted. To form the basis for patterns, head terms and enumerations are replaced with placeholders and the surrounding text is removed. These sentences are expanded and altered to allow for more variation. For instance, commas are added after introductory phrases and conjunctions are changed (e.g. ‘and’ is replaced with ‘or’). From these patterns, regular expressions are created automatically, which are used to match sentences and extract enumeration items and the head term. To add a new type of enumeration, one can simply add the new sentence to the corpus, which in turn leads to new generalized patterns. The generated regular expressions are stored on disk and are loaded for sibling generation.</p>
<p>
<italic>Web search</italic>
: Like the structure-based approach, web search is used to retrieve snippets. In the queries, the introductory phrase is included to find relevant results. Additionally, the
<monospace>NEAR</monospace>
operator of the search engine is used to force the seed terms to appear close to each other, which in most cases means in the same sentence. We do not retrieve the whole website, but instead use snippets (usually 300 characters long) provided by the search engines containing the search terms, and thus the enumeration. Again, pairwise combinations are used for the web queries.</p>
<p>
<italic>Text processing and enumeration extraction</italic>
: The retrieved snippets are first tokenized and then processed using sentence, POS, and NP tagging. For POS tagging, the LingPipe Tagger [
<ext-link ext-link-type="uri" xlink:href="http://alias-i.com/lingpipe/">http://alias-i.com/lingpipe/</ext-link>
] trained on the MEDLINE corpus is used. Phrases of the pattern
<monospace>[adj|verb]*[fill]{2}[noun]+</monospace>
are regarded as NPs (
<monospace>fill</monospace>
are words like ‘of’, ‘the’, ‘for’, etc.). Furthermore, abbreviations are extracted by checking if a candidate term contains a short form after the long form in brackets. If both forms match, the short form (i.e. the abbreviation) is grouped with the long form. The sentences are now matched against the regular expressions of the pipeline (‘morphosyntactic matching’). If a sentence contains multiple enumerations, all enumerations are extracted separately. To find as many enumerations as possible, three sets of regular expressions are used for matching and finding enumerations.</p>
<p>The first set consists of regular expressions including the head term, the introductory phrase, the enumeration items and a conjunction to separate the last two items (this conjunction does not exist if the phrase located is at the end). The search results are matched against the regular expressions. If a match occurs, the enumerated items are extracted and subsequently analyzed. If a seed term occurs among the extracted items, the remaining items become a candidate sibling set. If a snippet does not match a regular expression of this set, the next set is used. It matches all sentences which include the head term, introductory phrase and enumeration items. The last set matches all enumerations which include a conjunction at the end, but do not have an introductory phrase. Note that each set is more generic than the previous one.</p>
<p>Finally, all extracted terms are matched by checking if they are NPs (‘linguistic revision’). This is especially important for the last phrase (after the conjunction) where it is not possible to determine the end of the phrase reliably without the NP tagger.</p>
<p>In addition, enumerations in sentences without any introductory phrases, conjunctions or head terms are also retrieved. For this, the search engines are queried using only the seed terms concatenated by the
<monospace>NEAR</monospace>
operator. The separators between the phrases are automatically recognized and all enumerated items are subsequently extracted.</p>
</sec>
<sec id="SEC3.3">
<title>3.3 Syntactic filtering</title>
<p>To improve the accuracy of the extracted candidate siblings, a number of syntactic filter steps are used. We set a minimum length of 3 and a maximum length of 50 characters for each generated candidate sibling. By using a minimum length of 3 characters, gene and protein family names like ‘p53’ or ‘WNT’ are still regarded as valid items. By limiting the length to 50 characters, we can exclude any spurious NPs. In addition, we use a stop-word list to remove unnecessary words like ‘other’, ‘many more’ or ‘etc’ from the extracted siblings. Additionally, all candidate sibling sets containing less than three terms (including the seed terms) are dropped.</p>
<p>Duplicated candidate sibling sets (with the same siblings) are automatically discarded, since they are most likely retrieved from identical web pages.</p>
</sec>
<sec id="SEC3.4">
<title>3.4 Ranking</title>
<p>After retrieving all relevant web pages and extracting the candidate siblings, the siblings from the structure-based and text-based approaches need to be ranked and then aggregated into a single ranked list. For ranking the individual candidate sibling sets, we use a straightforward co-occurrence scheme: candidate siblings are ranked higher if they co-occur with more seed terms. Each candidate sibling
<italic>s</italic>
in a candidate sibling set containing
<italic>k</italic>
seed terms (
<italic>k</italic>
> 0) is given a score as follows:
<disp-formula>
<graphic xlink:href="bts215um1"></graphic>
</disp-formula>
</p>
<p>Thus, our system rewards siblings sets which contain a larger amount of seed terms and gives a low score (0.1) to sibling sets with only one seed term, since this co-occurrence may be coincidental. If a candidate sibling occurs in multiple candidate sibling sets, the scores are added up to yield a score for each candidate sibling. If a lexical variant such as a synonym or abbreviation of a seed term occurs in the candidate sibling set, it is also counted as a seed term.</p>
<p>In addition, the retrieved candidate siblings are re-ranked by the following measures:
<list list-type="bullet">
<list-item>
<p>
<italic>Hypernym matching</italic>
: For the text-based approach, we identify the head term of the enumeration. If this term matches the hypernym of the seed terms (i.e. their parent term), it is preferred. Since the head term in the text almost always occurs in plural, we stem the extracted head terms first.</p>
</list-item>
<list-item>
<p>
<italic>Compound term matching</italic>
[as proposed by
<xref ref-type="bibr" rid="B15">Ogren
<italic>et al</italic>
. (2004</xref>
)]: Biomedical terminology often consists of multiple compound terms, e.g. subterms of the MeSH term
<italic>Stem Cells</italic>
include
<italic>Adult Stem Cells</italic>
<italic>Hematopoietic Stem Cells</italic>
and
<italic>Mesenchymal Stem Cells</italic>
. We prefer the candidate siblings whose parent term is a suffix of this sibling.</p>
</list-item>
</list>
</p>
<p>
<italic>Combining of ranked lists of methods</italic>
: We examined several methods for rank aggregation of both methods. In our evaluation, summing up the normalized scores of the candidate siblings with identical labels and merging both lists by sorting them by their normalized scores yielded the best results. Previously, other ranking methods have been evaluated (
<xref ref-type="bibr" rid="B23">Wang and Cohen, 2007</xref>
,
<xref ref-type="bibr" rid="B4">Brunzel and Spiliopoulou, 2006</xref>
) and shown to have no significant impact on the overall results.</p>
</sec>
<sec id="SEC3.5">
<title>3.5 Evaluation</title>
<p>To evaluate our method, we used the 2011 MeSH [
<ext-link ext-link-type="uri" xlink:href="http://www.nlm.nih.gov/mesh">http://www.nlm.nih.gov/mesh</ext-link>
]. For this purpose, we randomly took a sample of 1000 terms in MeSH and chose three random child terms as seed terms for each set. For the selection of the parent term, we required them to have at least five child terms so the system is able to find potentially at least two siblings if three of the child terms are used as seed terms. Additionally, all child terms which consist of more than two words were not used as seed terms since they rarely occur in free text. Terms with artificial descriptor names (e.g. ‘
<italic>Surgical Procedures, Operative</italic>
’) were cleaned up. The 16 top-level categories were also ignored, since the terms are not semantically related, but are rather categorical. We batch-processed all 1000 term sets automatically using the implemented system.</p>
<p>We selected MeSH because it is a thesaurus with a broad coverage (it comprises 26 142 terms and 203 554 lexical variants). Although most of its terms are from the biomedical domain, it also contains terms from other domains, for example in the top-level categories
<italic>Geographicals</italic>
or
<italic>Humanities</italic>
.</p>
</sec>
</sec>
<sec id="SEC4">
<title>4 RESULTS</title>
<sec id="SEC4.1">
<title>4.1 MeSH evaluation</title>
<p>
<italic>Recall</italic>
: First, we evaluated how many of the remaining siblings were found. Of the 7922 siblings which were contained in the sets (not counting the seed terms), 6284 (79.3%) were discovered when using both approaches. For 601 of the 1000 selected sets, all siblings were discovered (
<xref ref-type="fig" rid="F4">Fig. 4</xref>
). Hence, our approach can find most of the existing siblings of the selected sibling sets. The results from both approaches have an overlap of 72.5%. Of the correct results, 35.0% were contributed exclusively by the structure-based approach, and 22.6% were contributed exclusively by the text-based approach. This shows that our idea of combining both approaches is reasonable and improves the overall results.
<fig id="F4" position="float">
<label>Fig. 4.</label>
<caption>
<p>Distribution of the recall of discovered siblings in all results using both approaches. For 601 of the 1000 sets, all siblings were found resulting in an average recall of 79.3 overall</p>
</caption>
<graphic xlink:href="bts215f4"></graphic>
</fig>
</p>
<p>
<italic>Precision</italic>
: Furthermore, we investigated the precision of the generated siblings to determine the fraction of siblings that are relevant with regard to the existing siblings. Precision is defined as
<disp-formula>
<graphic xlink:href="bts215um2"></graphic>
</disp-formula>
</p>
<p>We used a cut-off rank of 10 when evaluating precision (if the sibling set contains <10 siblings, we used the number of siblings in the set). Over the 1000 selected seed terms, the average precision using the structure-based approach is 53.0%, whereas the average precision for the text-based approach is 48.0%. When combining the two approaches using rank aggregation, the precision is 60.8%. The results show that rank aggregation improves precision when compared with the single approaches (
<xref ref-type="fig" rid="F5">Fig. 5</xref>
).
<fig id="F5" position="float">
<label>Fig. 5.</label>
<caption>
<p>Distribution of the precision of the siblings found from the 1000 randomly selected MeSH terms in the Top 10. Half of the tested sets were automatically extended with a precison of >75%. The structure-based approach has shown a higher precision than the text-based approach</p>
</caption>
<graphic xlink:href="bts215f5"></graphic>
</fig>
</p>
<p>Since recall and precision do not take the ranking of the generated siblings into account, we also examined the percentage of correct siblings for each position within the top 10 (
<xref ref-type="fig" rid="F6">Fig. 6</xref>
). Overall, the structure-based approach yields better results than the text-based approach. Combining the results improves the performance compared with the single approaches. In the highest ranked result, 76.6% are correct using rank aggregation, whereas 64.7% and 66.4% are correct for the structure-based and text-based approach, respectively. Thus, generated siblings which are true siblings in MeSH are ranked higher than siblings not belonging to the sibling sets of our MeSH evaluation. If the descendants of the siblings are also taken into account, the percentage of correctly generated siblings increases to 81.3% for the top ranked result (70.8% and 69.5% for the structure-based and text-based approach, respectively).
<fig id="F6" position="float">
<label>Fig. 6.</label>
<caption>
<p>Percentage of correctly generated siblings in the top 10. In general, ontologies can be extended using the top generated terms with high confidence. For the top 1 terms, 76.6% are generated correctly</p>
</caption>
<graphic xlink:href="bts215f6"></graphic>
</fig>
</p>
<p>
<xref ref-type="table" rid="T2">Table 2</xref>
shows three examples of generated siblings with varying recall and precision.
<italic>Lipids</italic>
has an overall recall of 100% and a precision of 90%.
<italic>Europe</italic>
contains 22 siblings, of which 15 were found (68% recall). However, it only has a precision of 70% within the top 10. When looking closer at the results, all of the top 10 terms are correct, although some are actually a child term of a sibling. An example of a term where only 4 of 24 siblings were found (16.66 recall) is
<italic>Environment</italic>
, which contains many generic terms (e.g.
<italic>Confined Spaces</italic>
or
<italic>Ecosystem</italic>
).
<table-wrap id="T2" position="float">
<label>Table 2.</label>
<caption>
<p>Top 10 results of selected examples</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Parent term</th>
<th colspan="2" align="center" rowspan="1">Lipids</th>
<th colspan="2" align="center" rowspan="1">Europe</th>
<th colspan="2" align="center" rowspan="1">Environment</th>
</tr>
<tr>
<th rowspan="1" colspan="1">Seed terms</th>
<th colspan="2" rowspan="1">Sphingolipids, Lipoproteins, Lipopeptides</th>
<th colspan="2" rowspan="1">Netherlands, Finland, Austria</th>
<th colspan="2" rowspan="1">Fires, Greenhouse Effect, Water Movements</th>
</tr>
<tr>
<th rowspan="1" colspan="1">Rank</th>
<th rowspan="1" colspan="1">Generated term</th>
<th rowspan="1" colspan="1">Relation</th>
<th rowspan="1" colspan="1">Generated term</th>
<th rowspan="1" colspan="1">Relation</th>
<th rowspan="1" colspan="1">Generated term</th>
<th rowspan="1" colspan="1">Relation</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">
<bold>Waxes</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">
<bold>Belgium</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Pollution</td>
<td rowspan="1" colspan="1">Not in MeSH</td>
</tr>
<tr>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">
<bold>Sterols</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Denmark</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Environmental Monitoring</td>
<td rowspan="1" colspan="1">Unrelated</td>
</tr>
<tr>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">
<bold>Lipopolysaccharides</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">
<bold>France</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">
<bold>Water Vapour</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">
<bold>Phospholipids</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Sweden</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Mars</td>
<td rowspan="1" colspan="1">Unrelated</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">
<bold>Glycolipids</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Norway</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Deposition</td>
<td rowspan="1" colspan="1">Not in MeSH</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">
<bold>Fatty acids</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">
<bold>Italy</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Air</td>
<td rowspan="1" colspan="1">Child</td>
</tr>
<tr>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">
<bold>Oils</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">
<bold>Switzerland</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Oxygen</td>
<td rowspan="1" colspan="1">Unrelated</td>
</tr>
<tr>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">
<bold>Peptidoglycans</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">
<bold>Spain</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">GDP</td>
<td rowspan="1" colspan="1">Not in MeSH</td>
</tr>
<tr>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">
<bold>Lipofuscin</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">
<bold>Ireland</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Ozone</td>
<td rowspan="1" colspan="1">Unrelated</td>
</tr>
<tr>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">
<bold>Membrane Lipids</bold>
</td>
<td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Hungary</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Clouds</td>
<td rowspan="1" colspan="1">Not in MeSH</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>True siblings in MeSH are printed in bold. The relation of the generated siblings is given relative to the location of the seed terms in MeSH. The relation ‘Unrelated’ means that the generated term is neither a sibling nor a child of a sibling, but occurs elsewhere in MeSH.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>
<italic>Number of seed terms</italic>
: The presented algorithm can use any number of seed terms as an input. However, to attain satisfactory results, a reasonable number of seed terms should be between two and four. If only one seed term is used, the algorithm may find candidate siblings which fit another meaning of the input than the intended one. If five or more seed terms are used, the execution time is too long, since the number of queries grows quadratically due to querying all pairwise combinations of the seed terms. We also evaluated the same 1000 randomly selected sets with one and two seed terms. When using two seed terms, recall decreases from 79.3% to 68.2% and precision from 60.8% to 51.5% (compared with three seed terms). When only one seed term is used, recall and precision drop to 25.2% and 15.3%, respectively. This shows that using only two seed terms produces satisfactory results, which can still support ontology engineers in ontology extension.</p>
<p>
<italic>Overfitting</italic>
: One issue we had to deal with are websites listing MeSH terms, yielding a very high precision and recall. However, we decided to retain these websites in the results, since they rarely occur and are difficult to filter out correctly.</p>
<p>
<italic>Siblings in abstracts and full-text articles</italic>
: We evaluated the number of siblings extracted from snippets found in MEDLINE abstracts and PubMed Central full-text articles, if one of them was part of the web search results. While only 59 abstracts contained siblings in our evaluation, 117 full-text articles contained enumerations which were used to generate siblings. This is especially noteworthy since MEDLINE contains 19 million abstracts [
<ext-link ext-link-type="uri" xlink:href="http://www.nlm.nih.gov/pubs/factsheets/medline.html">http://www.nlm.nih.gov/pubs/factsheets/medline.html</ext-link>
], whereas PubMed Central only contains 2.3 million full-text articles. This shows that for text mining in biomedical literature, full-text articles, patent information and website contents should always be taken into consideration, and can sometimes even be a more informative resource than just MEDLINE abstracts.</p>
</sec>
<sec id="SEC4.2">
<title>4.2 TREC ELC task</title>
<p>Since 2010, the TREC has included a task with a similar goal as part of the Entity track: Entity List Completion (ELC) (
<xref ref-type="bibr" rid="B2">Balog
<italic>et al</italic>
., 2011</xref>
). The task contains eight topics, each including a description of the task and a list of examples. This list should be expanded by finding entities from a given set which are in a sibling relationship to the examples. Finally, results have to be mapped to a given set of URIs. We skipped the last step since this was not in the scope of our work and should in general not decrease recall and precision. As a corpus, the English portion of the ClueWeb09 dataset, comprising ~500 million webpages, was used in the ELC task.</p>
<p>To test whether our system is capable of finding siblings from the topics, we performed a simple experiment. From the provided examples of each topic, we picked three, generated siblings from them using our approach, and checked whether the results correspond to the results in the provided sets (
<xref ref-type="table" rid="T3">Table 3</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S2</ext-link>
). All topics except one are not taken from the biomedical domain. The evaluation shows that our method can find the majority of the correct siblings (R-precision: 55.3%) and is capable of finding almost all siblings (Recall: 86.7%). This shows that our system can generate siblings in any domain and is thus universal. Compared with the other contestants of the ELC task, R-precision was 24.1% better than the best result (recall was not measured), although our evaluation method differs in some points from the ELC task.
<table-wrap id="T3" position="float">
<label>Table 3.</label>
<caption>
<p>Results from the 2010 TREC ELC task</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Domain</th>
<th rowspan="1" colspan="1">Siblings</th>
<th rowspan="1" colspan="1">Recall (%)</th>
<th rowspan="1" colspan="1">R-precision (%)</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Professional sports teams</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">75.0</td>
<td rowspan="1" colspan="1">62.5</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Pharmaceutical products</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">100.0</td>
<td rowspan="1" colspan="1">100.0</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Airlines</td>
<td rowspan="1" colspan="1">45</td>
<td rowspan="1" colspan="1">84.4</td>
<td rowspan="1" colspan="1">24.4</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Companies</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">60.0</td>
<td rowspan="1" colspan="1">20.0</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Airlines II</td>
<td rowspan="1" colspan="1">27</td>
<td rowspan="1" colspan="1">96.2</td>
<td rowspan="1" colspan="1">40.7</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Universities</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">100.0</td>
<td rowspan="1" colspan="1">70.0</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Television Chefs</td>
<td rowspan="1" colspan="1">40</td>
<td rowspan="1" colspan="1">77.5</td>
<td rowspan="1" colspan="1">45.0</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Whisky distilleries</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">100.0</td>
<td rowspan="1" colspan="1">80.0</td>
</tr>
<tr>
<td colspan="4" rowspan="1">
<hr></hr>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Average</td>
<td rowspan="1" colspan="1">18.25</td>
<td rowspan="1" colspan="1">86.7</td>
<td rowspan="1" colspan="1">55.3</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Only a subset of the available topics from the entity track was suitable for this task (a full description of the tasks is given in
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S2</ext-link>
). The column ‘Siblings’ shows the number of siblings that can be found. Recall is the percentage of found siblings. R-precision is the precision at the
<italic>R</italic>
-th position where
<italic>R</italic>
is the number of expected siblings.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="SEC4.3">
<title>4.3 Runtime</title>
<p>Our approach works on-the-fly, meaning every time siblings are generated, the pipeline (
<xref ref-type="fig" rid="F3">Fig. 3</xref>
) is re-run with the given seed terms. By generating siblings on-the-fly, the results are always up-to-date and seed terms do not have to be biomedical terminology, but can be from any domain. Even though the retrieval and extraction process is highly parallelized, generating siblings can take up to 9 s. The overwhelming amount of time is spent with querying the web search engines (on average 2.38 s) and retrieving websites (on average 5.92 s). However, by caching recent sibling generations, existing results are returned almost immediately.</p>
</sec>
<sec id="SEC4.4">
<title>4.4 Ontology generation plugin</title>
<p>We integrated this work into DOG4DAG (
<xref ref-type="bibr" rid="B22">Wächter and Schroeder, 2010</xref>
), our ontology generation plugin for Protégé and OBO-Edit (see
<xref ref-type="fig" rid="F7">Fig. 7</xref>
for a screenshot of the plugin in Protégé). Siblings can be generated by either selecting a term with at least two child terms or by manually typing the seed terms. For each sibling, the plugin automatically checks for cross-references to other biomedical ontologies using the EBI Ontology Lookup Service (
<xref ref-type="bibr" rid="B5">Côté
<italic>et al</italic>
., 2008</xref>
) or the BioPortal web service (
<xref ref-type="bibr" rid="B25">Whetzel
<italic>et al</italic>
., 2011</xref>
). This way, biocurators can identify other ontologies of interest and link their terms to them. Furthermore, generated terms are automatically mapped to terms in the currently loaded ontology to help biocurators link terms to other ontologies. Finally, already existing terms are printed in bold face, allowing the user to quickly spot them in the loaded ontology.
<fig id="F7" position="float">
<label>Fig. 7.</label>
<caption>
<p>Screenshot of sibling generation results in the DOG4DAG plugin for Protégé</p>
</caption>
<graphic xlink:href="bts215f7"></graphic>
</fig>
</p>
<p>For experimentation, we loaded the Human Disease Ontology (DO) [
<ext-link ext-link-type="uri" xlink:href="http://diseaseontology.sourceforge.net">http://diseaseontology.sourceforge.net</ext-link>
] (as of 15/12/2011) into OBO-Edit and selected three
<italic>genetic skin diseases</italic>
terms (
<italic>cutis laxa</italic>
,
<italic>Hailey–Hailey disease</italic>
and
<italic>Rothmund–Thomson syndrome</italic>
) as seed terms for sibling generation. The candidate siblings can be found in
<xref ref-type="table" rid="T4">Table 4</xref>
. Almost all of the generated terms are also genetic skin diseases. Most of them are already part of the ontology (the terms in bold face). However, many of them are simply categorized as
<italic>monogenic diseases</italic>
and could be added to
<italic>genetic skin diseases</italic>
right away. However, a number of candidate siblings, such as
<italic>Naegeli syndrome</italic>
are not yet part of the ontology yet and can also be added to the parent term.
<table-wrap id="T4" position="float">
<label>Table 4.</label>
<caption>
<p>Generated siblings for
<italic>genetic skin diseases</italic>
(DOID:1698) using the child terms
<italic>cutis laxa</italic>
(DOID:3144),
<italic>Hailey–Hailey disease</italic>
(DOID:0050429), and
<italic>Rothmund–Thomson disease</italic>
(DOID:2732) as seed terms</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Rank</th>
<th rowspan="1" colspan="1">Generated sibling</th>
<th rowspan="1" colspan="1">Parent term</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">
<bold>Cockayne syndrome</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Monogenic diseases</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1">
<bold>Xeroderma pigmentosum</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Monogenic diseases</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1">
<bold>Hypohidrotic ectodermal dysplasia</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Monogenic diseases</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1">
<bold>Incontinentia pigmenti</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Genetic skin disease</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">
<bold>Dyskeratosis congenita</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Genetic skin disease</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1">
<bold>Erythrokeratodermia veriabilis</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Genetic skin disease</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">
<bold>Clouston syndrome</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Monogenic disease</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">
<bold>Proteus syndrome</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Physical disorder</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1">
<bold>Erythropoietic Protoporphyria</bold>
</td>
<td rowspan="1" colspan="1">
<italic>Acute porphyria</italic>
</td>
</tr>
<tr>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">Naegeli syndrome</td>
<td rowspan="1" colspan="1">Not yet part of DO</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Most of the terms are already part of the loaded human disease ontology (printed in bold). However,
<italic>Naegeli syndrome</italic>
is a term that can be added to
<italic>genetic skin diseases</italic>
.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec id="SEC5">
<title>5 DISCUSSION</title>
<sec id="SEC5.1">
<title>5.1 Text-based versus structure-based approach</title>
<p>First, we will look at the results of the two approaches and discuss individual advantages and disadvantages.</p>
<p>When examining the quantity of sibling candidates alone, the structure-based approach yields more results, because in contrast to the text-based approach it requires only that seed terms occur on the same web page, but not necessarily close to each other. Even very distant terms can be semantically related, like headings separated by multiple paragraphs. As long as the headings are on the same path in the parse tree, they will be discovered. On the other hand, this also leads to false positives, since not all headings on the same path are necessarily semantically related. In contrast, the search queries in the text-based approach include the introductory phrase (except for the most generic search pattern) and thus potentially find less results.</p>
<p>The structure-based approach has a number of other advantages. First, it works on arbitrary HTML documents, meaning almost all of the web search results can be utilized. Additionally, being able to exploit the structure of a document also means that this approach works independently of the language.</p>
<p>The second, text-based approach is based on an entirely different idea. Here, siblings are generated from text by finding enumerations in sentences and extracting the individual terms and the head term (if available) from text. Regular expressions can match these patterns in sentences with great variability. By automatically generating regular expressions, we do not need to be concerned about errors and omissions when creating the regular expressions. The expressions have been optimized for biomedical and chemical terms. For instance, they allow non-ASCII characters, punctuation inside terms (e.g. ‘1,3-Butadiene’), and multi word terms.</p>
<p>The fact that the text-based approach finds less results lies in the very nature of the pattern-based approach, which usually yields low recall (
<xref ref-type="bibr" rid="B10">Hearst, 1992</xref>
) and also fits the observations of
<xref ref-type="bibr" rid="B8">Etzioni
<italic>et al</italic>
. (2005</xref>
): their structure-based ‘List Extractor’ component finds about five times more results than the text-based approach. Contrary to our initial assumption, the precision is equal or lower than the structure-based approach in the MeSH evaluation (
<xref ref-type="fig" rid="F5">Fig. 5</xref>
). This is mainly due to the format of the retrieved snippets which often contain truncated phrases and parts of sentences, making POS tagging and subsequent extraction of the enumerated terms harder.</p>
<p>Both approaches have a significant overlap in terms of generated siblings. This shows that each of them generates correct results independently. Nonetheless, each approach generates siblings which the other method does not discover.</p>
</sec>
<sec id="SEC5.2">
<title>5.2 Assessment of the completeness of ontologies</title>
<p>While there exist guidelines and tools that help to assess or even ensure the technical quality or consistency of a domain ontology (e.g.
<xref ref-type="bibr" rid="B26">Yao
<italic>et al</italic>
., 2011</xref>
), it is much harder to determine whether or not an ontology covers all aspects of the domain, hence it is hard to judge on
<italic>completeness</italic>
. With the help of our set expansion method for sibling discovery, we are able to provide some judgement by comparing the generated siblings with those already existing in the ontology.</p>
<p>
<italic>Overall completeness of MeSH</italic>
: Considering the evaluation for MeSH in
<xref ref-type="sec" rid="SEC4.1">Section 4.1</xref>
, the generated siblings for the 1000 random sibling sets can be divided into five categories which are listed in
<xref ref-type="table" rid="T5">Table 5</xref>
. Almost 50% of the terms are generated correctly as a sibling or descendant of a seed term and as such are siblings where the automatic method agrees with our gold standard MeSH. Another 30.1% of the generated siblings occur in MeSH but not as sibling and further 13.4% are not present in MeSH but exist as term label within the UMLS Metathesaurus [The UMLS Metathesaurus (
<xref ref-type="bibr" rid="B3">Bodenreider, 2004</xref>
) is a collection of controlled vocabularies in the biomedical domain (including MeSH) and currently contains over 1 000 000 terms in total]. In summary,
<xref ref-type="table" rid="T5">Table 5</xref>
shows that over 75% of the generated terms are part of MeSH, which indicates that MeSH is for the most part complete with regard to its term base. We also manually evaluated a sample of 100 generated sibling terms, which were not part of MeSH but can be found in the UMLS Metathesaurus. From these terms, as much as 52% were found to be true siblings of the seed terms (
<xref ref-type="table" rid="T6">Table 6</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S3</ext-link>
) and are as such good candidates to be added to MeSH in the future.
<table-wrap id="T5" position="float">
<label>Table 5.</label>
<caption>
<p>Distribution of the generated terms in MeSH and the UMLS Metathesaurus</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Category</th>
<th rowspan="1" colspan="1">Percentage (%)</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Sibling of seed term in MeSH</td>
<td rowspan="1" colspan="1">40.7</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Descendant of seed term in MeSH</td>
<td rowspan="1" colspan="1">6.5</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Occurs elsewhere in MeSH</td>
<td rowspan="1" colspan="1">30.1</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Occurs in UMLS, but not in MeSH</td>
<td rowspan="1" colspan="1">13.4</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Not found in UMLS</td>
<td rowspan="1" colspan="1">9.3</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>In all, 47.2% of the terms are highly relevant and 90.3% are correct biomedical terminology. (Please note that the top 10 results are taken into account, no matter how many siblings the seed terms have in MeSH.)</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T6" position="float">
<label>Table 6.</label>
<caption>
<p>Manual evaluation of the terms from the categories ‘Occurs elsewhere in MeSH’ and ‘Occurs in UMLS, but not in MeSH’</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Category</th>
<th rowspan="1" colspan="1">True siblings (%)</th>
<th rowspan="1" colspan="1">Related siblings (%)</th>
<th rowspan="1" colspan="1">False siblings (%)</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">Occurs elsewhere in MeSH</td>
<td rowspan="1" colspan="1">16</td>
<td rowspan="1" colspan="1">47</td>
<td rowspan="1" colspan="1">37</td>
</tr>
<tr>
<td rowspan="1" colspan="1">Occurs in UMLS, but not in MeSH</td>
<td rowspan="1" colspan="1">52</td>
<td rowspan="1" colspan="1">16</td>
<td rowspan="1" colspan="1">32</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>True siblings</italic>
are generated terms that can be added to existing seed terms.
<italic>Related siblings</italic>
are terms with a similar subject, but are not true siblings of the seed terms.
<italic>False siblings</italic>
are terms which are not related to the seed terms.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>
<italic>Consistency of MeSH</italic>
: Finally, we also evaluated a sample of 100 generated siblings which were not siblings of the seed terms but occurred at a different position within MeSH (see ‘Occurs elsewhere in MeSH’ in
<xref ref-type="table" rid="T6">Table 6</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S4</ext-link>
). Only as few as 16% were found to be true siblings (
<xref ref-type="table" rid="T6">Table 6</xref>
). This indicates that MeSH is for the most part modelled correctly with regard to the location of its siblings.</p>
<p>Both results demonstrate that sibling generation is a powerful tool to assess the completeness of ontologies.</p>
</sec>
<sec id="SEC5.3">
<title>5.3 Adherence to ontology design criteria and naming conventions</title>
<p>Additionally, we examined the generated siblings regarding criteria for ontology design and naming conventions. In
<xref ref-type="bibr" rid="B18">Schober
<italic>et al</italic>
. (2009</xref>
), conventions for OBO Foundry ontologies were presented. The conventions support unified ontology development and help developers avoid mistakes when working on ontologies. Overall, the generated siblings adhere to the proposed design guidelines and naming conventions. For instance, new terms should incorporate the genus-differentia style for names. Since we prefer candidate siblings containing this style, these are ranked higher in the results. Another convention is that acronyms should be expanded. Because we included an abbreviation tagger in the processing pipeline (
<xref ref-type="fig" rid="F3">Fig. 3</xref>
), they are automatically expanded, if possible. Finally, since the text-based approach only allows NPs as candidate siblings, we avoid the use of conjunctions.</p>
</sec>
<sec id="SEC5.4">
<title>5.4 Limitations and future work</title>
<p>Generally, our work is based on the assumption that terms can in principle be found in text and that the web is representative for a domain to be modelled by the ontology. Although we developed our approach as generic as possible, some limitations are nonetheless inherent.</p>
<p>First, we can only generate siblings for natural language terms which are semantically related and discussed in the literature or on websites in general. Furthermore, we do not take the specific relationship type and synonymous terms into account.</p>
<p>Overall, the system works best for completing a set of terms with the same semantic type. However, it cannot explicitly recognize the context of the given seed terms. We will consider on incorporating contextual information in the query to increase the precision of the generated siblings. Additionally, we will work on the improvement of recall of the text-based approach by two means. First, if the retrieved snippet is not a full sentence, fetch the whole webpage and extend the existing snippet to complete the sentence. Second, extend the number of patterns for text-based sibling discovery using a bootstrapped pattern learning approach, similar to the ones presented in
<xref ref-type="bibr" rid="B8">Etzioni
<italic>et al</italic>
. (2005</xref>
) and
<xref ref-type="bibr" rid="B12">Kozareva
<italic>et al</italic>
. (2008</xref>
). We also plan to further improve the scalability of sibling generation when more than four seed terms are used. Finally, we will investigate whether using a higher number of seed terms can effectively improve the precision of the retrieved candidate siblings.</p>
</sec>
</sec>
<sec id="SEC6">
<title>6 CONCLUSION</title>
<p>In this work, we presented an approach to extend ontologies systematically by finding new terms similar to two or three provided terms. We combined two very different methods and used a simple rank aggregation strategy to combine the results. By taking the peculiarities of biomedical terminology into consideration, we used hypernym matching and compound term matching to improve the ranking of terms which fulfill these criteria.</p>
<p>The evaluation using MeSH shows that our approach can successfully support ontology engineers by semi-automatically completing existing sets of siblings. Additionally, our approach can also serve as a first step towards evaluating the completeness of ontologies. We showed that MeSH covers the biomedical domain as the vast majority of siblings suggested by the method are already contained. Nonetheless, a significant number of good candidates for incorporation could be suggested with high precision. Furthermore, our evaluation suggests that text mining in the biomedical domain gains significantly from full-text resources such as PubMed Central.</p>
<p>In particular, when evaluating set expansion for sibling discovery using 1000 randomly selected term sets from MeSH, our approach finds 79.3% of the existing siblings in the sets using three seed terms. Both methods contribute to the results. However, the structure-based approach finds slightly more true positives than the text-based approach. When only two seed terms are used, the method still produces satisfactory results, but recall and precision drop to 68.2% and 51.5%, respectively. The generated terms fulfill ontology naming conventions and need no post-editing.</p>
<p>Our method is universal, meaning the system allows sibling generation for any domain, as shown by the evaluation using the TREC ELC task, where 86.7% of the siblings were discovered with a precision of 55.3%. Additionally, the method can in principle generate siblings for any language, since the structure-based approach works independently of the language.</p>
<p>Since this work is integrated into the DOG4DAG plugin, ontologies of all common formats can be extended seamlessly in Protégé and OBO-Edit and generated terms are cross-referenced to other biomedical ontologies.</p>
<p>
<italic>Funding</italic>
:
<funding-source>PONTE and GeneCloud</funding-source>
.</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data">
<caption>
<title>Supplementary Data</title>
</caption>
<media mimetype="text" mime-subtype="html" xlink:href="supp_28_12_i292__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_1.pdf"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_2.pdf"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_3.pdf"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_4.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ref-list>
<title>REFERENCES</title>
<ref id="B1">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ashburner</surname>
<given-names>M.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Gene Ontology: tool for the unification of biology</article-title>
<source>Nat. Genet.</source>
<year>2000</year>
<volume>25</volume>
<fpage>25</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="pmid">10802651</pub-id>
</element-citation>
</ref>
<ref id="B2">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Balog</surname>
<given-names>K.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Overview of the TREC 2010 entity track</article-title>
<source>Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010)</source>
<year>2011</year>
<publisher-loc>Gaithersburg, Maryland, USA</publisher-loc>
</element-citation>
</ref>
<ref id="B3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bodenreider</surname>
<given-names>O.</given-names>
</name>
</person-group>
<article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<fpage>D267</fpage>
<lpage>D270</lpage>
<pub-id pub-id-type="pmid">14681409</pub-id>
</element-citation>
</ref>
<ref id="B4">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Brunzel</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Spiliopoulou</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Discovering Multi Terms and Co-hyponymy from XHTML Documents with XTREEM</article-title>
<source>Knowledge Discovery from XML Documents.</source>
<year>2006</year>
<volume>3915</volume>
<publisher-loc>Berlin/Heidelberg</publisher-loc>
<publisher-name>Springer</publisher-name>
<fpage>22</fpage>
<lpage>32</lpage>
<comment>Lecture Notes in Computer Science doi: 10.1007/11730262_5</comment>
</element-citation>
</ref>
<ref id="B5">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Côté</surname>
<given-names>R.G.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The Ontology Lookup Service: more data and better tools for controlled vocabulary queries</article-title>
<source>Nucleic Acids Res.</source>
<year>2008</year>
<volume>36</volume>
<fpage>W372</fpage>
<lpage>W376</lpage>
<pub-id pub-id-type="pmid">18467421</pub-id>
</element-citation>
</ref>
<ref id="B6">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Day-Richter</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>OBO-Edit–an ontology editor for biologists</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>2198</fpage>
<lpage>2200</lpage>
<pub-id pub-id-type="pmid">17545183</pub-id>
</element-citation>
</ref>
<ref id="B7">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Doms</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Schroeder</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>GoPubMed: exploring PubMed with the Gene Ontology</article-title>
<source>Nucleic Acids Res.</source>
<year>2005</year>
<volume>33</volume>
<fpage>W783</fpage>
<lpage>W786</lpage>
<pub-id pub-id-type="pmid">15980585</pub-id>
</element-citation>
</ref>
<ref id="B8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Etzioni</surname>
<given-names>O.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Unsupervised named-entity extraction from the Web: an experimental study</article-title>
<source>Artif. Intell.</source>
<year>2005</year>
<volume>165</volume>
<fpage>91</fpage>
<lpage>134</lpage>
</element-citation>
</ref>
<ref id="B9">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Frantzi</surname>
<given-names>K.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Automatic recognition of multi-word terms: the C-value/NC-value Method</article-title>
<source>Int. J. Digit. Libr.</source>
<year>2000</year>
<volume>3</volume>
<fpage>115</fpage>
<lpage>130</lpage>
</element-citation>
</ref>
<ref id="B10">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Hearst</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Automatic acquisition of hyponyms from large text corpora</article-title>
<source>Proceedings of the 14th Conference on Computational Linguistics.</source>
<year>1992</year>
<volume>2</volume>
<publisher-loc>Nantes, France</publisher-loc>
<fpage>539</fpage>
<lpage>545</lpage>
</element-citation>
</ref>
<ref id="B11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Howe</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Big data: the future of biocuration</article-title>
<source>Nature</source>
<year>2008</year>
<volume>455</volume>
<fpage>47</fpage>
<lpage>50</lpage>
<pub-id pub-id-type="pmid">18769432</pub-id>
</element-citation>
</ref>
<ref id="B12">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Kozareva</surname>
<given-names>Z.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Semantic class learning from the web with hyponym pattern linkage graphs</article-title>
<source>Proceedings of ACL-08: HLT</source>
<year>2008</year>
<publisher-loc>Columbus, OH, USA</publisher-loc>
<fpage>1048</fpage>
<lpage>1056</lpage>
</element-citation>
</ref>
<ref id="B13">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Induction of semantic classes from natural language text</article-title>
<source>Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
<year>2001</year>
<publisher-loc>San Francisco, CA, USA</publisher-loc>
<fpage>317</fpage>
<lpage>322</lpage>
</element-citation>
</ref>
<ref id="B14">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>K.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Natural language processing methods and systems for biomedical ontology learning</article-title>
<source>J. Biomed. Inform.</source>
<year>2011</year>
<volume>44</volume>
<fpage>163</fpage>
<lpage>179</lpage>
<pub-id pub-id-type="pmid">20647054</pub-id>
</element-citation>
</ref>
<ref id="B15">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ogren</surname>
<given-names>P.V.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The compositional structure of gene ontology terms</article-title>
<source>Pacific Symposium on Biocomputing</source>
<year>2004</year>
<publisher-loc>Hawaii, USA</publisher-loc>
<fpage>214</fpage>
<lpage>225</lpage>
</element-citation>
</ref>
<ref id="B16">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Pantel</surname>
<given-names>P.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Web-scale distributional similarity and entity set expansion</article-title>
<source>Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</source>
<year>2009</year>
<publisher-loc>Singapore</publisher-loc>
<fpage>938</fpage>
<lpage>947</lpage>
</element-citation>
</ref>
<ref id="B17">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Paşca,M.</surname>
</name>
</person-group>
<article-title>Acquisition of categorized named entities for web search</article-title>
<source>Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management</source>
<year>2004</year>
<publisher-loc>Washington, DC, USA</publisher-loc>
<fpage>137</fpage>
<lpage>145</lpage>
</element-citation>
</ref>
<ref id="B18">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schober</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Survey-based naming conventions for use in OBO Foundry ontology development</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<fpage>125</fpage>
<pub-id pub-id-type="pmid">19397794</pub-id>
</element-citation>
</ref>
<ref id="B19">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Shi</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Pattern-based semantic class discovery with multi-membership support</article-title>
<source>Proceeding of the 17th ACM Conference on Information and Knowledge Management</source>
<year>2008</year>
<publisher-loc>Napa Valley, California, USA</publisher-loc>
<fpage>1453</fpage>
<lpage>1454</lpage>
</element-citation>
</ref>
<ref id="B20">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Shi</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Corpus-based semantic class mining: distributional vs. pattern-based approaches</article-title>
<source>Proceedings of the 23rd International Conference on Computational Linguistics</source>
<year>2010</year>
<publisher-loc>Beijing, China</publisher-loc>
<fpage>993</fpage>
<lpage>1001</lpage>
</element-citation>
</ref>
<ref id="B21">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shinzato</surname>
<given-names>K.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Acquiring hyponymy relations from web documents</article-title>
<source>Proc. HLT-NAACL</source>
<year>2004</year>
<volume>2004</volume>
<fpage>73</fpage>
<lpage>80</lpage>
</element-citation>
</ref>
<ref id="B22">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wächter</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Schroeder</surname>
<given-names>M.</given-names>
</name>
</person-group>
<article-title>Semi-automated ontology generation within OBO-Edit</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>i88</fpage>
<lpage>i96</lpage>
<pub-id pub-id-type="pmid">20529942</pub-id>
</element-citation>
</ref>
<ref id="B23">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>W.</given-names>
</name>
</person-group>
<article-title>Language-independent set expansion of named entities using the web</article-title>
<source>2007 Seventh IEEE International Conference on Data Mining</source>
<year>2007</year>
<fpage>342</fpage>
<lpage>350</lpage>
</element-citation>
</ref>
<ref id="B24">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Whetzel</surname>
<given-names>P.L.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The MGED Ontology: a resource for semantics-based description of microarray experiments</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>866</fpage>
<lpage>873</lpage>
<pub-id pub-id-type="pmid">16428806</pub-id>
</element-citation>
</ref>
<ref id="B25">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Whetzel</surname>
<given-names>P.L.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications</article-title>
<source>Nucleic Acids Res.</source>
<year>2011</year>
<volume>39</volume>
<fpage>W541</fpage>
<lpage>W545</lpage>
<pub-id pub-id-type="pmid">21672956</pub-id>
</element-citation>
</ref>
<ref id="B26">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yao</surname>
<given-names>L.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Benchmarking ontologies: bigger or better?</article-title>
<source>PLoS Comput. Biol.</source>
<year>2011</year>
<volume>7</volume>
<fpage>e1001055</fpage>
<pub-id pub-id-type="pmid">21249231</pub-id>
</element-citation>
</ref>
<ref id="B27">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>H.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Employing topic models for pattern-based semantic class discovery</article-title>
<source>Proceedings of ACL/AFNLP 2009</source>
<year>2009</year>
<fpage>459</fpage>
<lpage>467</lpage>
</element-citation>
</ref>
</ref-list>
</back>
</pmc></record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Linguistique/explor/TamazightV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0001800 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0001800 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Linguistique
   |area=    TamazightV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Wed Nov 15 18:28:35 2017. Site generation: Sat Feb 10 16:46:27 2024