</monospace>) if the seed term was found inside a table data cell). Extract the terms from nodes which have the same parent node inside the parse tree and the identical HTML tag as the seed term node.</p><p><italic>Group candidates</italic> : Group all extracted candidate siblings into a candidate sibling set. If the seed term is preceded or followed by a string (such as ‘function of <italic>seed term</italic> ’) in the text, all candidate siblings are also required to include this string.</p> </sec><sec id="SEC3.2"><title>3.2 Text-based approach</title>
<p>The second approach uses textual patterns to extract candidate siblings from enumerations in text. By querying search engines, we retrieve text snippets on-the-fly, from which siblings are extracted, filtered and ranked.</p>
<p><italic>Pattern extraction and expansion</italic> : We built a small, manually annotated corpus containing sentences with typical enumerations. Whenever a sentence is added to the corpus, the annotated sentences are preprocessed automatically. From each sentence, head terms, enumeration items and words in between are extracted. To form the basis for patterns, head terms and enumerations are replaced with placeholders and the surrounding text is removed. These sentences are expanded and altered to allow for more variation. For instance, commas are added after introductory phrases and conjunctions are changed (e.g. ‘and’ is replaced with ‘or’). From these patterns, regular expressions are created automatically, which are used to match sentences and extract enumeration items and the head term. To add a new type of enumeration, one can simply add the new sentence to the corpus, which in turn leads to new generalized patterns. The generated regular expressions are stored on disk and are loaded for sibling generation.</p> <p><italic>Web search</italic> : Like the structure-based approach, web search is used to retrieve snippets. In the queries, the introductory phrase is included to find relevant results. Additionally, the <monospace>NEAR</monospace> operator of the search engine is used to force the seed terms to appear close to each other, which in most cases means in the same sentence. We do not retrieve the whole website, but instead use snippets (usually 300 characters long) provided by the search engines containing the search terms, and thus the enumeration. Again, pairwise combinations are used for the web queries.</p> <p><italic>Text processing and enumeration extraction</italic> : The retrieved snippets are first tokenized and then processed using sentence, POS, and NP tagging. For POS tagging, the LingPipe Tagger [<ext-link ext-link-type="uri" xlink:href="http://alias-i.com/lingpipe/">http://alias-i.com/lingpipe/</ext-link> ] trained on the MEDLINE corpus is used. Phrases of the pattern <monospace>[adj|verb]*[fill]{2}[noun]+</monospace> are regarded as NPs (<monospace>fill</monospace> are words like ‘of’, ‘the’, ‘for’, etc.). Furthermore, abbreviations are extracted by checking if a candidate term contains a short form after the long form in brackets. If both forms match, the short form (i.e. the abbreviation) is grouped with the long form. The sentences are now matched against the regular expressions of the pipeline (‘morphosyntactic matching’). If a sentence contains multiple enumerations, all enumerations are extracted separately. To find as many enumerations as possible, three sets of regular expressions are used for matching and finding enumerations.</p> <p>The first set consists of regular expressions including the head term, the introductory phrase, the enumeration items and a conjunction to separate the last two items (this conjunction does not exist if the phrase located is at the end). The search results are matched against the regular expressions. If a match occurs, the enumerated items are extracted and subsequently analyzed. If a seed term occurs among the extracted items, the remaining items become a candidate sibling set. If a snippet does not match a regular expression of this set, the next set is used. It matches all sentences which include the head term, introductory phrase and enumeration items. The last set matches all enumerations which include a conjunction at the end, but do not have an introductory phrase. Note that each set is more generic than the previous one.</p>
<p>Finally, all extracted terms are matched by checking if they are NPs (‘linguistic revision’). This is especially important for the last phrase (after the conjunction) where it is not possible to determine the end of the phrase reliably without the NP tagger.</p>
<p>In addition, enumerations in sentences without any introductory phrases, conjunctions or head terms are also retrieved. For this, the search engines are queried using only the seed terms concatenated by the <monospace>NEAR</monospace> operator. The separators between the phrases are automatically recognized and all enumerated items are subsequently extracted.</p> </sec> <sec id="SEC3.3"><title>3.3 Syntactic filtering</title>
<p>To improve the accuracy of the extracted candidate siblings, a number of syntactic filter steps are used. We set a minimum length of 3 and a maximum length of 50 characters for each generated candidate sibling. By using a minimum length of 3 characters, gene and protein family names like ‘p53’ or ‘WNT’ are still regarded as valid items. By limiting the length to 50 characters, we can exclude any spurious NPs. In addition, we use a stop-word list to remove unnecessary words like ‘other’, ‘many more’ or ‘etc’ from the extracted siblings. Additionally, all candidate sibling sets containing less than three terms (including the seed terms) are dropped.</p>
<p>Duplicated candidate sibling sets (with the same siblings) are automatically discarded, since they are most likely retrieved from identical web pages.</p> </sec> <sec id="SEC3.4"><title>3.4 Ranking</title>
<p>After retrieving all relevant web pages and extracting the candidate siblings, the siblings from the structure-based and text-based approaches need to be ranked and then aggregated into a single ranked list. For ranking the individual candidate sibling sets, we use a straightforward co-occurrence scheme: candidate siblings are ranked higher if they co-occur with more seed terms. Each candidate sibling <italic>s</italic> in a candidate sibling set containing <italic>k</italic> seed terms (<italic>k</italic> > 0) is given a score as follows:
<disp-formula><graphic xlink:href="bts215um1"></graphic> </disp-formula> </p> <p>Thus, our system rewards siblings sets which contain a larger amount of seed terms and gives a low score (0.1) to sibling sets with only one seed term, since this co-occurrence may be coincidental. If a candidate sibling occurs in multiple candidate sibling sets, the scores are added up to yield a score for each candidate sibling. If a lexical variant such as a synonym or abbreviation of a seed term occurs in the candidate sibling set, it is also counted as a seed term.</p>
<p>In addition, the retrieved candidate siblings are re-ranked by the following measures:
<list list-type="bullet"><list-item><p><italic>Hypernym matching</italic> : For the text-based approach, we identify the head term of the enumeration. If this term matches the hypernym of the seed terms (i.e. their parent term), it is preferred. Since the head term in the text almost always occurs in plural, we stem the extracted head terms first.</p> </list-item> <list-item><p><italic>Compound term matching</italic> [as proposed by <xref ref-type="bibr" rid="B15">Ogren <italic>et al</italic> . (2004</xref> )]: Biomedical terminology often consists of multiple compound terms, e.g. subterms of the MeSH term <italic>Stem Cells</italic> include <italic>Adult Stem Cells</italic>
<italic>Hematopoietic Stem Cells</italic> and <italic>Mesenchymal Stem Cells</italic> . We prefer the candidate siblings whose parent term is a suffix of this sibling.</p> </list-item> </list> </p> <p><italic>Combining of ranked lists of methods</italic> : We examined several methods for rank aggregation of both methods. In our evaluation, summing up the normalized scores of the candidate siblings with identical labels and merging both lists by sorting them by their normalized scores yielded the best results. Previously, other ranking methods have been evaluated (<xref ref-type="bibr" rid="B23">Wang and Cohen, 2007</xref> , <xref ref-type="bibr" rid="B4">Brunzel and Spiliopoulou, 2006</xref> ) and shown to have no significant impact on the overall results.</p> </sec> <sec id="SEC3.5"><title>3.5 Evaluation</title>
<p>To evaluate our method, we used the 2011 MeSH [<ext-link ext-link-type="uri" xlink:href="http://www.nlm.nih.gov/mesh">http://www.nlm.nih.gov/mesh</ext-link> ]. For this purpose, we randomly took a sample of 1000 terms in MeSH and chose three random child terms as seed terms for each set. For the selection of the parent term, we required them to have at least five child terms so the system is able to find potentially at least two siblings if three of the child terms are used as seed terms. Additionally, all child terms which consist of more than two words were not used as seed terms since they rarely occur in free text. Terms with artificial descriptor names (e.g. ‘<italic>Surgical Procedures, Operative</italic> ’) were cleaned up. The 16 top-level categories were also ignored, since the terms are not semantically related, but are rather categorical. We batch-processed all 1000 term sets automatically using the implemented system.</p> <p>We selected MeSH because it is a thesaurus with a broad coverage (it comprises 26 142 terms and 203 554 lexical variants). Although most of its terms are from the biomedical domain, it also contains terms from other domains, for example in the top-level categories <italic>Geographicals</italic> or <italic>Humanities</italic> .</p> </sec> </sec><sec id="SEC4"><title>4 RESULTS</title>
<sec id="SEC4.1"><title>4.1 MeSH evaluation</title>
<p><italic>Recall</italic> : First, we evaluated how many of the remaining siblings were found. Of the 7922 siblings which were contained in the sets (not counting the seed terms), 6284 (79.3%) were discovered when using both approaches. For 601 of the 1000 selected sets, all siblings were discovered (<xref ref-type="fig" rid="F4">Fig. 4</xref> ). Hence, our approach can find most of the existing siblings of the selected sibling sets. The results from both approaches have an overlap of 72.5%. Of the correct results, 35.0% were contributed exclusively by the structure-based approach, and 22.6% were contributed exclusively by the text-based approach. This shows that our idea of combining both approaches is reasonable and improves the overall results.
<fig id="F4" position="float"><label>Fig. 4.</label>
<caption><p>Distribution of the recall of discovered siblings in all results using both approaches. For 601 of the 1000 sets, all siblings were found resulting in an average recall of 79.3 overall</p> </caption> <graphic xlink:href="bts215f4"></graphic> </fig> </p> <p><italic>Precision</italic> : Furthermore, we investigated the precision of the generated siblings to determine the fraction of siblings that are relevant with regard to the existing siblings. Precision is defined as
<disp-formula><graphic xlink:href="bts215um2"></graphic> </disp-formula> </p> <p>We used a cut-off rank of 10 when evaluating precision (if the sibling set contains <10 siblings, we used the number of siblings in the set). Over the 1000 selected seed terms, the average precision using the structure-based approach is 53.0%, whereas the average precision for the text-based approach is 48.0%. When combining the two approaches using rank aggregation, the precision is 60.8%. The results show that rank aggregation improves precision when compared with the single approaches (<xref ref-type="fig" rid="F5">Fig. 5</xref> ).
<fig id="F5" position="float"><label>Fig. 5.</label>
<caption><p>Distribution of the precision of the siblings found from the 1000 randomly selected MeSH terms in the Top 10. Half of the tested sets were automatically extended with a precison of >75%. The structure-based approach has shown a higher precision than the text-based approach</p> </caption> <graphic xlink:href="bts215f5"></graphic> </fig> </p> <p>Since recall and precision do not take the ranking of the generated siblings into account, we also examined the percentage of correct siblings for each position within the top 10 (<xref ref-type="fig" rid="F6">Fig. 6</xref> ). Overall, the structure-based approach yields better results than the text-based approach. Combining the results improves the performance compared with the single approaches. In the highest ranked result, 76.6% are correct using rank aggregation, whereas 64.7% and 66.4% are correct for the structure-based and text-based approach, respectively. Thus, generated siblings which are true siblings in MeSH are ranked higher than siblings not belonging to the sibling sets of our MeSH evaluation. If the descendants of the siblings are also taken into account, the percentage of correctly generated siblings increases to 81.3% for the top ranked result (70.8% and 69.5% for the structure-based and text-based approach, respectively).
<fig id="F6" position="float"><label>Fig. 6.</label>
<caption><p>Percentage of correctly generated siblings in the top 10. In general, ontologies can be extended using the top generated terms with high confidence. For the top 1 terms, 76.6% are generated correctly</p> </caption> <graphic xlink:href="bts215f6"></graphic> </fig> </p> <p><xref ref-type="table" rid="T2">Table 2</xref> shows three examples of generated siblings with varying recall and precision. <italic>Lipids</italic> has an overall recall of 100% and a precision of 90%. <italic>Europe</italic> contains 22 siblings, of which 15 were found (68% recall). However, it only has a precision of 70% within the top 10. When looking closer at the results, all of the top 10 terms are correct, although some are actually a child term of a sibling. An example of a term where only 4 of 24 siblings were found (16.66 recall) is <italic>Environment</italic> , which contains many generic terms (e.g. <italic>Confined Spaces</italic> or <italic>Ecosystem</italic> ).
<table-wrap id="T2" position="float"><label>Table 2.</label>
<caption><p>Top 10 results of selected examples</p> </caption> <table frame="hsides" rules="groups"><thead align="left"><tr><th rowspan="1" colspan="1">Parent term</th>
<th colspan="2" align="center" rowspan="1">Lipids</th>
<th colspan="2" align="center" rowspan="1">Europe</th>
<th colspan="2" align="center" rowspan="1">Environment</th> </tr> <tr><th rowspan="1" colspan="1">Seed terms</th>
<th colspan="2" rowspan="1">Sphingolipids, Lipoproteins, Lipopeptides</th>
<th colspan="2" rowspan="1">Netherlands, Finland, Austria</th>
<th colspan="2" rowspan="1">Fires, Greenhouse Effect, Water Movements</th> </tr> <tr><th rowspan="1" colspan="1">Rank</th>
<th rowspan="1" colspan="1">Generated term</th>
<th rowspan="1" colspan="1">Relation</th>
<th rowspan="1" colspan="1">Generated term</th>
<th rowspan="1" colspan="1">Relation</th>
<th rowspan="1" colspan="1">Generated term</th>
<th rowspan="1" colspan="1">Relation</th> </tr> </thead> <tbody align="left"><tr><td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1"><bold>Waxes</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1"><bold>Belgium</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Pollution</td>
<td rowspan="1" colspan="1">Not in MeSH</td> </tr> <tr><td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1"><bold>Sterols</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Denmark</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Environmental Monitoring</td>
<td rowspan="1" colspan="1">Unrelated</td> </tr> <tr><td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1"><bold>Lipopolysaccharides</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1"><bold>France</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1"><bold>Water Vapour</bold> </td> <td rowspan="1" colspan="1">Sibling</td> </tr> <tr><td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1"><bold>Phospholipids</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Sweden</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Mars</td>
<td rowspan="1" colspan="1">Unrelated</td> </tr> <tr><td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1"><bold>Glycolipids</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Norway</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Deposition</td>
<td rowspan="1" colspan="1">Not in MeSH</td> </tr> <tr><td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1"><bold>Fatty acids</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1"><bold>Italy</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Air</td>
<td rowspan="1" colspan="1">Child</td> </tr> <tr><td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1"><bold>Oils</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1"><bold>Switzerland</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Oxygen</td>
<td rowspan="1" colspan="1">Unrelated</td> </tr> <tr><td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1"><bold>Peptidoglycans</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1"><bold>Spain</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">GDP</td>
<td rowspan="1" colspan="1">Not in MeSH</td> </tr> <tr><td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1"><bold>Lipofuscin</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1"><bold>Ireland</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Ozone</td>
<td rowspan="1" colspan="1">Unrelated</td> </tr> <tr><td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1"><bold>Membrane Lipids</bold> </td> <td rowspan="1" colspan="1">Sibling</td>
<td rowspan="1" colspan="1">Hungary</td>
<td rowspan="1" colspan="1">Child</td>
<td rowspan="1" colspan="1">Clouds</td>
<td rowspan="1" colspan="1">Not in MeSH</td> </tr> </tbody> </table> <table-wrap-foot><fn><p>True siblings in MeSH are printed in bold. The relation of the generated siblings is given relative to the location of the seed terms in MeSH. The relation ‘Unrelated’ means that the generated term is neither a sibling nor a child of a sibling, but occurs elsewhere in MeSH.</p> </fn> </table-wrap-foot> </table-wrap> </p> <p><italic>Number of seed terms</italic> : The presented algorithm can use any number of seed terms as an input. However, to attain satisfactory results, a reasonable number of seed terms should be between two and four. If only one seed term is used, the algorithm may find candidate siblings which fit another meaning of the input than the intended one. If five or more seed terms are used, the execution time is too long, since the number of queries grows quadratically due to querying all pairwise combinations of the seed terms. We also evaluated the same 1000 randomly selected sets with one and two seed terms. When using two seed terms, recall decreases from 79.3% to 68.2% and precision from 60.8% to 51.5% (compared with three seed terms). When only one seed term is used, recall and precision drop to 25.2% and 15.3%, respectively. This shows that using only two seed terms produces satisfactory results, which can still support ontology engineers in ontology extension.</p> <p><italic>Overfitting</italic> : One issue we had to deal with are websites listing MeSH terms, yielding a very high precision and recall. However, we decided to retain these websites in the results, since they rarely occur and are difficult to filter out correctly.</p> <p><italic>Siblings in abstracts and full-text articles</italic> : We evaluated the number of siblings extracted from snippets found in MEDLINE abstracts and PubMed Central full-text articles, if one of them was part of the web search results. While only 59 abstracts contained siblings in our evaluation, 117 full-text articles contained enumerations which were used to generate siblings. This is especially noteworthy since MEDLINE contains 19 million abstracts [<ext-link ext-link-type="uri" xlink:href="http://www.nlm.nih.gov/pubs/factsheets/medline.html">http://www.nlm.nih.gov/pubs/factsheets/medline.html</ext-link> ], whereas PubMed Central only contains 2.3 million full-text articles. This shows that for text mining in biomedical literature, full-text articles, patent information and website contents should always be taken into consideration, and can sometimes even be a more informative resource than just MEDLINE abstracts.</p> </sec> <sec id="SEC4.2"><title>4.2 TREC ELC task</title>
<p>Since 2010, the TREC has included a task with a similar goal as part of the Entity track: Entity List Completion (ELC) (<xref ref-type="bibr" rid="B2">Balog <italic>et al</italic> ., 2011</xref> ). The task contains eight topics, each including a description of the task and a list of examples. This list should be expanded by finding entities from a given set which are in a sibling relationship to the examples. Finally, results have to be mapped to a given set of URIs. We skipped the last step since this was not in the scope of our work and should in general not decrease recall and precision. As a corpus, the English portion of the ClueWeb09 dataset, comprising ~500 million webpages, was used in the ELC task.</p> <p>To test whether our system is capable of finding siblings from the topics, we performed a simple experiment. From the provided examples of each topic, we picked three, generated siblings from them using our approach, and checked whether the results correspond to the results in the provided sets (<xref ref-type="table" rid="T3">Table 3</xref> and <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S2</ext-link> ). All topics except one are not taken from the biomedical domain. The evaluation shows that our method can find the majority of the correct siblings (R-precision: 55.3%) and is capable of finding almost all siblings (Recall: 86.7%). This shows that our system can generate siblings in any domain and is thus universal. Compared with the other contestants of the ELC task, R-precision was 24.1% better than the best result (recall was not measured), although our evaluation method differs in some points from the ELC task.
<table-wrap id="T3" position="float"><label>Table 3.</label>
<caption><p>Results from the 2010 TREC ELC task</p> </caption> <table frame="hsides" rules="groups"><thead align="left"><tr><th rowspan="1" colspan="1">Domain</th>
<th rowspan="1" colspan="1">Siblings</th>
<th rowspan="1" colspan="1">Recall (%)</th>
<th rowspan="1" colspan="1">R-precision (%)</th> </tr> </thead> <tbody align="left"><tr><td rowspan="1" colspan="1">Professional sports teams</td>
<td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1">75.0</td>
<td rowspan="1" colspan="1">62.5</td> </tr> <tr><td rowspan="1" colspan="1">Pharmaceutical products</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">100.0</td>
<td rowspan="1" colspan="1">100.0</td> </tr> <tr><td rowspan="1" colspan="1">Airlines</td>
<td rowspan="1" colspan="1">45</td>
<td rowspan="1" colspan="1">84.4</td>
<td rowspan="1" colspan="1">24.4</td> </tr> <tr><td rowspan="1" colspan="1">Companies</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">60.0</td>
<td rowspan="1" colspan="1">20.0</td> </tr> <tr><td rowspan="1" colspan="1">Airlines II</td>
<td rowspan="1" colspan="1">27</td>
<td rowspan="1" colspan="1">96.2</td>
<td rowspan="1" colspan="1">40.7</td> </tr> <tr><td rowspan="1" colspan="1">Universities</td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">100.0</td>
<td rowspan="1" colspan="1">70.0</td> </tr> <tr><td rowspan="1" colspan="1">Television Chefs</td>
<td rowspan="1" colspan="1">40</td>
<td rowspan="1" colspan="1">77.5</td>
<td rowspan="1" colspan="1">45.0</td> </tr> <tr><td rowspan="1" colspan="1">Whisky distilleries</td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">100.0</td>
<td rowspan="1" colspan="1">80.0</td> </tr> <tr><td colspan="4" rowspan="1"><hr></hr> </td> </tr> <tr><td rowspan="1" colspan="1">Average</td>
<td rowspan="1" colspan="1">18.25</td>
<td rowspan="1" colspan="1">86.7</td>
<td rowspan="1" colspan="1">55.3</td> </tr> </tbody> </table> <table-wrap-foot><fn><p>Only a subset of the available topics from the entity track was suitable for this task (a full description of the tasks is given in <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S2</ext-link> ). The column ‘Siblings’ shows the number of siblings that can be found. Recall is the percentage of found siblings. R-precision is the precision at the <italic>R</italic> -th position where <italic>R</italic> is the number of expected siblings.</p> </fn> </table-wrap-foot> </table-wrap> </p> </sec> <sec id="SEC4.3"><title>4.3 Runtime</title>
<p>Our approach works on-the-fly, meaning every time siblings are generated, the pipeline (<xref ref-type="fig" rid="F3">Fig. 3</xref> ) is re-run with the given seed terms. By generating siblings on-the-fly, the results are always up-to-date and seed terms do not have to be biomedical terminology, but can be from any domain. Even though the retrieval and extraction process is highly parallelized, generating siblings can take up to 9 s. The overwhelming amount of time is spent with querying the web search engines (on average 2.38 s) and retrieving websites (on average 5.92 s). However, by caching recent sibling generations, existing results are returned almost immediately.</p> </sec> <sec id="SEC4.4"><title>4.4 Ontology generation plugin</title>
<p>We integrated this work into DOG4DAG (<xref ref-type="bibr" rid="B22">Wächter and Schroeder, 2010</xref> ), our ontology generation plugin for Protégé and OBO-Edit (see <xref ref-type="fig" rid="F7">Fig. 7</xref> for a screenshot of the plugin in Protégé). Siblings can be generated by either selecting a term with at least two child terms or by manually typing the seed terms. For each sibling, the plugin automatically checks for cross-references to other biomedical ontologies using the EBI Ontology Lookup Service (<xref ref-type="bibr" rid="B5">Côté <italic>et al</italic> ., 2008</xref> ) or the BioPortal web service (<xref ref-type="bibr" rid="B25">Whetzel <italic>et al</italic> ., 2011</xref> ). This way, biocurators can identify other ontologies of interest and link their terms to them. Furthermore, generated terms are automatically mapped to terms in the currently loaded ontology to help biocurators link terms to other ontologies. Finally, already existing terms are printed in bold face, allowing the user to quickly spot them in the loaded ontology.
<fig id="F7" position="float"><label>Fig. 7.</label>
<caption><p>Screenshot of sibling generation results in the DOG4DAG plugin for Protégé</p> </caption> <graphic xlink:href="bts215f7"></graphic> </fig> </p> <p>For experimentation, we loaded the Human Disease Ontology (DO) [<ext-link ext-link-type="uri" xlink:href="http://diseaseontology.sourceforge.net">http://diseaseontology.sourceforge.net</ext-link> ] (as of 15/12/2011) into OBO-Edit and selected three <italic>genetic skin diseases</italic> terms (<italic>cutis laxa</italic> , <italic>Hailey–Hailey disease</italic> and <italic>Rothmund–Thomson syndrome</italic> ) as seed terms for sibling generation. The candidate siblings can be found in <xref ref-type="table" rid="T4">Table 4</xref> . Almost all of the generated terms are also genetic skin diseases. Most of them are already part of the ontology (the terms in bold face). However, many of them are simply categorized as <italic>monogenic diseases</italic> and could be added to <italic>genetic skin diseases</italic> right away. However, a number of candidate siblings, such as <italic>Naegeli syndrome</italic> are not yet part of the ontology yet and can also be added to the parent term.
<table-wrap id="T4" position="float"><label>Table 4.</label>
<caption><p>Generated siblings for <italic>genetic skin diseases</italic> (DOID:1698) using the child terms <italic>cutis laxa</italic> (DOID:3144), <italic>Hailey–Hailey disease</italic> (DOID:0050429), and <italic>Rothmund–Thomson disease</italic> (DOID:2732) as seed terms</p> </caption> <table frame="hsides" rules="groups"><thead align="left"><tr><th rowspan="1" colspan="1">Rank</th>
<th rowspan="1" colspan="1">Generated sibling</th>
<th rowspan="1" colspan="1">Parent term</th> </tr> </thead> <tbody align="left"><tr><td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1"><bold>Cockayne syndrome</bold> </td> <td rowspan="1" colspan="1"><italic>Monogenic diseases</italic> </td> </tr> <tr><td rowspan="1" colspan="1">2</td>
<td rowspan="1" colspan="1"><bold>Xeroderma pigmentosum</bold> </td> <td rowspan="1" colspan="1"><italic>Monogenic diseases</italic> </td> </tr> <tr><td rowspan="1" colspan="1">3</td>
<td rowspan="1" colspan="1"><bold>Hypohidrotic ectodermal dysplasia</bold> </td> <td rowspan="1" colspan="1"><italic>Monogenic diseases</italic> </td> </tr> <tr><td rowspan="1" colspan="1">4</td>
<td rowspan="1" colspan="1"><bold>Incontinentia pigmenti</bold> </td> <td rowspan="1" colspan="1"><italic>Genetic skin disease</italic> </td> </tr> <tr><td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1"><bold>Dyskeratosis congenita</bold> </td> <td rowspan="1" colspan="1"><italic>Genetic skin disease</italic> </td> </tr> <tr><td rowspan="1" colspan="1">6</td>
<td rowspan="1" colspan="1"><bold>Erythrokeratodermia veriabilis</bold> </td> <td rowspan="1" colspan="1"><italic>Genetic skin disease</italic> </td> </tr> <tr><td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1"><bold>Clouston syndrome</bold> </td> <td rowspan="1" colspan="1"><italic>Monogenic disease</italic> </td> </tr> <tr><td rowspan="1" colspan="1">8</td>
<td rowspan="1" colspan="1"><bold>Proteus syndrome</bold> </td> <td rowspan="1" colspan="1"><italic>Physical disorder</italic> </td> </tr> <tr><td rowspan="1" colspan="1">9</td>
<td rowspan="1" colspan="1"><bold>Erythropoietic Protoporphyria</bold> </td> <td rowspan="1" colspan="1"><italic>Acute porphyria</italic> </td> </tr> <tr><td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">Naegeli syndrome</td>
<td rowspan="1" colspan="1">Not yet part of DO</td> </tr> </tbody> </table> <table-wrap-foot><fn><p>Most of the terms are already part of the loaded human disease ontology (printed in bold). However, <italic>Naegeli syndrome</italic> is a term that can be added to <italic>genetic skin diseases</italic> .</p> </fn> </table-wrap-foot> </table-wrap> </p> </sec> </sec> <sec id="SEC5"><title>5 DISCUSSION</title>
<sec id="SEC5.1"><title>5.1 Text-based versus structure-based approach</title>
<p>First, we will look at the results of the two approaches and discuss individual advantages and disadvantages.</p>
<p>When examining the quantity of sibling candidates alone, the structure-based approach yields more results, because in contrast to the text-based approach it requires only that seed terms occur on the same web page, but not necessarily close to each other. Even very distant terms can be semantically related, like headings separated by multiple paragraphs. As long as the headings are on the same path in the parse tree, they will be discovered. On the other hand, this also leads to false positives, since not all headings on the same path are necessarily semantically related. In contrast, the search queries in the text-based approach include the introductory phrase (except for the most generic search pattern) and thus potentially find less results.</p>
<p>The structure-based approach has a number of other advantages. First, it works on arbitrary HTML documents, meaning almost all of the web search results can be utilized. Additionally, being able to exploit the structure of a document also means that this approach works independently of the language.</p>
<p>The second, text-based approach is based on an entirely different idea. Here, siblings are generated from text by finding enumerations in sentences and extracting the individual terms and the head term (if available) from text. Regular expressions can match these patterns in sentences with great variability. By automatically generating regular expressions, we do not need to be concerned about errors and omissions when creating the regular expressions. The expressions have been optimized for biomedical and chemical terms. For instance, they allow non-ASCII characters, punctuation inside terms (e.g. ‘1,3-Butadiene’), and multi word terms.</p>
<p>The fact that the text-based approach finds less results lies in the very nature of the pattern-based approach, which usually yields low recall (<xref ref-type="bibr" rid="B10">Hearst, 1992</xref> ) and also fits the observations of <xref ref-type="bibr" rid="B8">Etzioni <italic>et al</italic> . (2005</xref> ): their structure-based ‘List Extractor’ component finds about five times more results than the text-based approach. Contrary to our initial assumption, the precision is equal or lower than the structure-based approach in the MeSH evaluation (<xref ref-type="fig" rid="F5">Fig. 5</xref> ). This is mainly due to the format of the retrieved snippets which often contain truncated phrases and parts of sentences, making POS tagging and subsequent extraction of the enumerated terms harder.</p> <p>Both approaches have a significant overlap in terms of generated siblings. This shows that each of them generates correct results independently. Nonetheless, each approach generates siblings which the other method does not discover.</p> </sec> <sec id="SEC5.2"><title>5.2 Assessment of the completeness of ontologies</title>
<p>While there exist guidelines and tools that help to assess or even ensure the technical quality or consistency of a domain ontology (e.g. <xref ref-type="bibr" rid="B26">Yao <italic>et al</italic> ., 2011</xref> ), it is much harder to determine whether or not an ontology covers all aspects of the domain, hence it is hard to judge on <italic>completeness</italic> . With the help of our set expansion method for sibling discovery, we are able to provide some judgement by comparing the generated siblings with those already existing in the ontology.</p> <p><italic>Overall completeness of MeSH</italic> : Considering the evaluation for MeSH in <xref ref-type="sec" rid="SEC4.1">Section 4.1</xref> , the generated siblings for the 1000 random sibling sets can be divided into five categories which are listed in <xref ref-type="table" rid="T5">Table 5</xref> . Almost 50% of the terms are generated correctly as a sibling or descendant of a seed term and as such are siblings where the automatic method agrees with our gold standard MeSH. Another 30.1% of the generated siblings occur in MeSH but not as sibling and further 13.4% are not present in MeSH but exist as term label within the UMLS Metathesaurus [The UMLS Metathesaurus (<xref ref-type="bibr" rid="B3">Bodenreider, 2004</xref> ) is a collection of controlled vocabularies in the biomedical domain (including MeSH) and currently contains over 1 000 000 terms in total]. In summary, <xref ref-type="table" rid="T5">Table 5</xref> shows that over 75% of the generated terms are part of MeSH, which indicates that MeSH is for the most part complete with regard to its term base. We also manually evaluated a sample of 100 generated sibling terms, which were not part of MeSH but can be found in the UMLS Metathesaurus. From these terms, as much as 52% were found to be true siblings of the seed terms (<xref ref-type="table" rid="T6">Table 6</xref> and <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S3</ext-link> ) and are as such good candidates to be added to MeSH in the future.
<table-wrap id="T5" position="float"><label>Table 5.</label>
<caption><p>Distribution of the generated terms in MeSH and the UMLS Metathesaurus</p> </caption> <table frame="hsides" rules="groups"><thead align="left"><tr><th rowspan="1" colspan="1">Category</th>
<th rowspan="1" colspan="1">Percentage (%)</th> </tr> </thead> <tbody align="left"><tr><td rowspan="1" colspan="1">Sibling of seed term in MeSH</td>
<td rowspan="1" colspan="1">40.7</td> </tr> <tr><td rowspan="1" colspan="1">Descendant of seed term in MeSH</td>
<td rowspan="1" colspan="1">6.5</td> </tr> <tr><td rowspan="1" colspan="1">Occurs elsewhere in MeSH</td>
<td rowspan="1" colspan="1">30.1</td> </tr> <tr><td rowspan="1" colspan="1">Occurs in UMLS, but not in MeSH</td>
<td rowspan="1" colspan="1">13.4</td> </tr> <tr><td rowspan="1" colspan="1">Not found in UMLS</td>
<td rowspan="1" colspan="1">9.3</td> </tr> </tbody> </table> <table-wrap-foot><fn><p>In all, 47.2% of the terms are highly relevant and 90.3% are correct biomedical terminology. (Please note that the top 10 results are taken into account, no matter how many siblings the seed terms have in MeSH.)</p> </fn> </table-wrap-foot> </table-wrap> <table-wrap id="T6" position="float"><label>Table 6.</label>
<caption><p>Manual evaluation of the terms from the categories ‘Occurs elsewhere in MeSH’ and ‘Occurs in UMLS, but not in MeSH’</p> </caption> <table frame="hsides" rules="groups"><thead align="left"><tr><th rowspan="1" colspan="1">Category</th>
<th rowspan="1" colspan="1">True siblings (%)</th>
<th rowspan="1" colspan="1">Related siblings (%)</th>
<th rowspan="1" colspan="1">False siblings (%)</th> </tr> </thead> <tbody align="left"><tr><td rowspan="1" colspan="1">Occurs elsewhere in MeSH</td>
<td rowspan="1" colspan="1">16</td>
<td rowspan="1" colspan="1">47</td>
<td rowspan="1" colspan="1">37</td> </tr> <tr><td rowspan="1" colspan="1">Occurs in UMLS, but not in MeSH</td>
<td rowspan="1" colspan="1">52</td>
<td rowspan="1" colspan="1">16</td>
<td rowspan="1" colspan="1">32</td> </tr> </tbody> </table> <table-wrap-foot><fn><p><italic>True siblings</italic> are generated terms that can be added to existing seed terms. <italic>Related siblings</italic> are terms with a similar subject, but are not true siblings of the seed terms. <italic>False siblings</italic> are terms which are not related to the seed terms.</p> </fn> </table-wrap-foot> </table-wrap> </p> <p><italic>Consistency of MeSH</italic> : Finally, we also evaluated a sample of 100 generated siblings which were not siblings of the seed terms but occurred at a different position within MeSH (see ‘Occurs elsewhere in MeSH’ in <xref ref-type="table" rid="T6">Table 6</xref> and <ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/bts215/DC1">Supplementary Table S4</ext-link> ). Only as few as 16% were found to be true siblings (<xref ref-type="table" rid="T6">Table 6</xref> ). This indicates that MeSH is for the most part modelled correctly with regard to the location of its siblings.</p> <p>Both results demonstrate that sibling generation is a powerful tool to assess the completeness of ontologies.</p> </sec> <sec id="SEC5.3"><title>5.3 Adherence to ontology design criteria and naming conventions</title>
<p>Additionally, we examined the generated siblings regarding criteria for ontology design and naming conventions. In <xref ref-type="bibr" rid="B18">Schober <italic>et al</italic> . (2009</xref> ), conventions for OBO Foundry ontologies were presented. The conventions support unified ontology development and help developers avoid mistakes when working on ontologies. Overall, the generated siblings adhere to the proposed design guidelines and naming conventions. For instance, new terms should incorporate the genus-differentia style for names. Since we prefer candidate siblings containing this style, these are ranked higher in the results. Another convention is that acronyms should be expanded. Because we included an abbreviation tagger in the processing pipeline (<xref ref-type="fig" rid="F3">Fig. 3</xref> ), they are automatically expanded, if possible. Finally, since the text-based approach only allows NPs as candidate siblings, we avoid the use of conjunctions.</p> </sec> <sec id="SEC5.4"><title>5.4 Limitations and future work</title>
<p>Generally, our work is based on the assumption that terms can in principle be found in text and that the web is representative for a domain to be modelled by the ontology. Although we developed our approach as generic as possible, some limitations are nonetheless inherent.</p>
<p>First, we can only generate siblings for natural language terms which are semantically related and discussed in the literature or on websites in general. Furthermore, we do not take the specific relationship type and synonymous terms into account.</p>
<p>Overall, the system works best for completing a set of terms with the same semantic type. However, it cannot explicitly recognize the context of the given seed terms. We will consider on incorporating contextual information in the query to increase the precision of the generated siblings. Additionally, we will work on the improvement of recall of the text-based approach by two means. First, if the retrieved snippet is not a full sentence, fetch the whole webpage and extend the existing snippet to complete the sentence. Second, extend the number of patterns for text-based sibling discovery using a bootstrapped pattern learning approach, similar to the ones presented in <xref ref-type="bibr" rid="B8">Etzioni <italic>et al</italic> . (2005</xref> ) and <xref ref-type="bibr" rid="B12">Kozareva <italic>et al</italic> . (2008</xref> ). We also plan to further improve the scalability of sibling generation when more than four seed terms are used. Finally, we will investigate whether using a higher number of seed terms can effectively improve the precision of the retrieved candidate siblings.</p> </sec> </sec> <sec id="SEC6"><title>6 CONCLUSION</title>
<p>In this work, we presented an approach to extend ontologies systematically by finding new terms similar to two or three provided terms. We combined two very different methods and used a simple rank aggregation strategy to combine the results. By taking the peculiarities of biomedical terminology into consideration, we used hypernym matching and compound term matching to improve the ranking of terms which fulfill these criteria.</p>
<p>The evaluation using MeSH shows that our approach can successfully support ontology engineers by semi-automatically completing existing sets of siblings. Additionally, our approach can also serve as a first step towards evaluating the completeness of ontologies. We showed that MeSH covers the biomedical domain as the vast majority of siblings suggested by the method are already contained. Nonetheless, a significant number of good candidates for incorporation could be suggested with high precision. Furthermore, our evaluation suggests that text mining in the biomedical domain gains significantly from full-text resources such as PubMed Central.</p>
<p>In particular, when evaluating set expansion for sibling discovery using 1000 randomly selected term sets from MeSH, our approach finds 79.3% of the existing siblings in the sets using three seed terms. Both methods contribute to the results. However, the structure-based approach finds slightly more true positives than the text-based approach. When only two seed terms are used, the method still produces satisfactory results, but recall and precision drop to 68.2% and 51.5%, respectively. The generated terms fulfill ontology naming conventions and need no post-editing.</p>
<p>Our method is universal, meaning the system allows sibling generation for any domain, as shown by the evaluation using the TREC ELC task, where 86.7% of the siblings were discovered with a precision of 55.3%. Additionally, the method can in principle generate siblings for any language, since the structure-based approach works independently of the language.</p>
<p>Since this work is integrated into the DOG4DAG plugin, ontologies of all common formats can be extended seamlessly in Protégé and OBO-Edit and generated terms are cross-referenced to other biomedical ontologies.</p>
<p><italic>Funding</italic> : <funding-source>PONTE and GeneCloud</funding-source> .</p> <p><italic>Conflict of Interest</italic> : none declared.</p> </sec> <sec sec-type="supplementary-material"><title>Supplementary Material</title>
<supplementary-material id="PMC_1" content-type="local-data"><caption><title>Supplementary Data</title> </caption> <media mimetype="text" mime-subtype="html" xlink:href="supp_28_12_i292__index.html"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_1.pdf"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_2.pdf"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_3.pdf"></media>
<media xlink:role="associated-file" mimetype="application" mime-subtype="pdf" xlink:href="supp_bts215_Fabian_32_sup_4.pdf"></media> </supplementary-material> </sec> </body><back><ref-list><title>REFERENCES</title>
<ref id="B1"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ashburner</surname>
<given-names>M.</given-names> </name> <etal></etal> </person-group> <article-title>Gene Ontology: tool for the unification of biology</article-title>
<source>Nat. Genet.</source>
<year>2000</year>
<volume>25</volume>
<fpage>25</fpage>
<lpage>29</lpage>
<pub-id pub-id-type="pmid">10802651</pub-id> </element-citation> </ref> <ref id="B2"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Balog</surname>
<given-names>K.</given-names> </name> <etal></etal> </person-group> <article-title>Overview of the TREC 2010 entity track</article-title>
<source>Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010)</source>
<year>2011</year>
<publisher-loc>Gaithersburg, Maryland, USA</publisher-loc> </element-citation> </ref> <ref id="B3"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bodenreider</surname>
<given-names>O.</given-names> </name> </person-group> <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<fpage>D267</fpage>
<lpage>D270</lpage>
<pub-id pub-id-type="pmid">14681409</pub-id> </element-citation> </ref> <ref id="B4"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Brunzel</surname>
<given-names>M.</given-names> </name> <name><surname>Spiliopoulou</surname>
<given-names>M.</given-names> </name> </person-group> <article-title>Discovering Multi Terms and Co-hyponymy from XHTML Documents with XTREEM</article-title>
<source>Knowledge Discovery from XML Documents.</source>
<year>2006</year>
<volume>3915</volume>
<publisher-loc>Berlin/Heidelberg</publisher-loc>
<publisher-name>Springer</publisher-name>
<fpage>22</fpage>
<lpage>32</lpage>
<comment>Lecture Notes in Computer Science doi: 10.1007/11730262_5</comment> </element-citation> </ref> <ref id="B5"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Côté</surname>
<given-names>R.G.</given-names> </name> <etal></etal> </person-group> <article-title>The Ontology Lookup Service: more data and better tools for controlled vocabulary queries</article-title>
<source>Nucleic Acids Res.</source>
<year>2008</year>
<volume>36</volume>
<fpage>W372</fpage>
<lpage>W376</lpage>
<pub-id pub-id-type="pmid">18467421</pub-id> </element-citation> </ref> <ref id="B6"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Day-Richter</surname>
<given-names>J.</given-names> </name> <etal></etal> </person-group> <article-title>OBO-Edit–an ontology editor for biologists</article-title>
<source>Bioinformatics</source>
<year>2007</year>
<volume>23</volume>
<fpage>2198</fpage>
<lpage>2200</lpage>
<pub-id pub-id-type="pmid">17545183</pub-id> </element-citation> </ref> <ref id="B7"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Doms</surname>
<given-names>A.</given-names> </name> <name><surname>Schroeder</surname>
<given-names>M.</given-names> </name> </person-group> <article-title>GoPubMed: exploring PubMed with the Gene Ontology</article-title>
<source>Nucleic Acids Res.</source>
<year>2005</year>
<volume>33</volume>
<fpage>W783</fpage>
<lpage>W786</lpage>
<pub-id pub-id-type="pmid">15980585</pub-id> </element-citation> </ref> <ref id="B8"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Etzioni</surname>
<given-names>O.</given-names> </name> <etal></etal> </person-group> <article-title>Unsupervised named-entity extraction from the Web: an experimental study</article-title>
<source>Artif. Intell.</source>
<year>2005</year>
<volume>165</volume>
<fpage>91</fpage>
<lpage>134</lpage> </element-citation> </ref> <ref id="B9"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Frantzi</surname>
<given-names>K.</given-names> </name> <etal></etal> </person-group> <article-title>Automatic recognition of multi-word terms: the C-value/NC-value Method</article-title>
<source>Int. J. Digit. Libr.</source>
<year>2000</year>
<volume>3</volume>
<fpage>115</fpage>
<lpage>130</lpage> </element-citation> </ref> <ref id="B10"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Hearst</surname>
<given-names>M.</given-names> </name> </person-group> <article-title>Automatic acquisition of hyponyms from large text corpora</article-title>
<source>Proceedings of the 14th Conference on Computational Linguistics.</source>
<year>1992</year>
<volume>2</volume>
<publisher-loc>Nantes, France</publisher-loc>
<fpage>539</fpage>
<lpage>545</lpage> </element-citation> </ref> <ref id="B11"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Howe</surname>
<given-names>D.</given-names> </name> <etal></etal> </person-group> <article-title>Big data: the future of biocuration</article-title>
<source>Nature</source>
<year>2008</year>
<volume>455</volume>
<fpage>47</fpage>
<lpage>50</lpage>
<pub-id pub-id-type="pmid">18769432</pub-id> </element-citation> </ref> <ref id="B12"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Kozareva</surname>
<given-names>Z.</given-names> </name> <etal></etal> </person-group> <article-title>Semantic class learning from the web with hyponym pattern linkage graphs</article-title>
<source>Proceedings of ACL-08: HLT</source>
<year>2008</year>
<publisher-loc>Columbus, OH, USA</publisher-loc>
<fpage>1048</fpage>
<lpage>1056</lpage> </element-citation> </ref> <ref id="B13"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Lin</surname>
<given-names>D.</given-names> </name> <etal></etal> </person-group> <article-title>Induction of semantic classes from natural language text</article-title>
<source>Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
<year>2001</year>
<publisher-loc>San Francisco, CA, USA</publisher-loc>
<fpage>317</fpage>
<lpage>322</lpage> </element-citation> </ref> <ref id="B14"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname>
<given-names>K.</given-names> </name> <etal></etal> </person-group> <article-title>Natural language processing methods and systems for biomedical ontology learning</article-title>
<source>J. Biomed. Inform.</source>
<year>2011</year>
<volume>44</volume>
<fpage>163</fpage>
<lpage>179</lpage>
<pub-id pub-id-type="pmid">20647054</pub-id> </element-citation> </ref> <ref id="B15"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Ogren</surname>
<given-names>P.V.</given-names> </name> <etal></etal> </person-group> <article-title>The compositional structure of gene ontology terms</article-title>
<source>Pacific Symposium on Biocomputing</source>
<year>2004</year>
<publisher-loc>Hawaii, USA</publisher-loc>
<fpage>214</fpage>
<lpage>225</lpage> </element-citation> </ref> <ref id="B16"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Pantel</surname>
<given-names>P.</given-names> </name> <etal></etal> </person-group> <article-title>Web-scale distributional similarity and entity set expansion</article-title>
<source>Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</source>
<year>2009</year>
<publisher-loc>Singapore</publisher-loc>
<fpage>938</fpage>
<lpage>947</lpage> </element-citation> </ref> <ref id="B17"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Paşca,M.</surname> </name> </person-group> <article-title>Acquisition of categorized named entities for web search</article-title>
<source>Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management</source>
<year>2004</year>
<publisher-loc>Washington, DC, USA</publisher-loc>
<fpage>137</fpage>
<lpage>145</lpage> </element-citation> </ref> <ref id="B18"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Schober</surname>
<given-names>D.</given-names> </name> <etal></etal> </person-group> <article-title>Survey-based naming conventions for use in OBO Foundry ontology development</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<fpage>125</fpage>
<pub-id pub-id-type="pmid">19397794</pub-id> </element-citation> </ref> <ref id="B19"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Shi</surname>
<given-names>S.</given-names> </name> <etal></etal> </person-group> <article-title>Pattern-based semantic class discovery with multi-membership support</article-title>
<source>Proceeding of the 17th ACM Conference on Information and Knowledge Management</source>
<year>2008</year>
<publisher-loc>Napa Valley, California, USA</publisher-loc>
<fpage>1453</fpage>
<lpage>1454</lpage> </element-citation> </ref> <ref id="B20"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Shi</surname>
<given-names>S.</given-names> </name> <etal></etal> </person-group> <article-title>Corpus-based semantic class mining: distributional vs. pattern-based approaches</article-title>
<source>Proceedings of the 23rd International Conference on Computational Linguistics</source>
<year>2010</year>
<publisher-loc>Beijing, China</publisher-loc>
<fpage>993</fpage>
<lpage>1001</lpage> </element-citation> </ref> <ref id="B21"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Shinzato</surname>
<given-names>K.</given-names> </name> <etal></etal> </person-group> <article-title>Acquiring hyponymy relations from web documents</article-title>
<source>Proc. HLT-NAACL</source>
<year>2004</year>
<volume>2004</volume>
<fpage>73</fpage>
<lpage>80</lpage> </element-citation> </ref> <ref id="B22"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wächter</surname>
<given-names>T.</given-names> </name> <name><surname>Schroeder</surname>
<given-names>M.</given-names> </name> </person-group> <article-title>Semi-automated ontology generation within OBO-Edit</article-title>
<source>Bioinformatics</source>
<year>2010</year>
<volume>26</volume>
<fpage>i88</fpage>
<lpage>i96</lpage>
<pub-id pub-id-type="pmid">20529942</pub-id> </element-citation> </ref> <ref id="B23"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname>
<given-names>R.</given-names> </name> <name><surname>Cohen</surname>
<given-names>W.</given-names> </name> </person-group> <article-title>Language-independent set expansion of named entities using the web</article-title>
<source>2007 Seventh IEEE International Conference on Data Mining</source>
<year>2007</year>
<fpage>342</fpage>
<lpage>350</lpage> </element-citation> </ref> <ref id="B24"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Whetzel</surname>
<given-names>P.L.</given-names> </name> <etal></etal> </person-group> <article-title>The MGED Ontology: a resource for semantics-based description of microarray experiments</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>866</fpage>
<lpage>873</lpage>
<pub-id pub-id-type="pmid">16428806</pub-id> </element-citation> </ref> <ref id="B25"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Whetzel</surname>
<given-names>P.L.</given-names> </name> <etal></etal> </person-group> <article-title>BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications</article-title>
<source>Nucleic Acids Res.</source>
<year>2011</year>
<volume>39</volume>
<fpage>W541</fpage>
<lpage>W545</lpage>
<pub-id pub-id-type="pmid">21672956</pub-id> </element-citation> </ref> <ref id="B26"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yao</surname>
<given-names>L.</given-names> </name> <etal></etal> </person-group> <article-title>Benchmarking ontologies: bigger or better?</article-title>
<source>PLoS Comput. Biol.</source>
<year>2011</year>
<volume>7</volume>
<fpage>e1001055</fpage>
<pub-id pub-id-type="pmid">21249231</pub-id> </element-citation> </ref> <ref id="B27"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname>
<given-names>H.</given-names> </name> <etal></etal> </person-group> <article-title>Employing topic models for pattern-based semantic class discovery</article-title>
<source>Proceedings of ACL/AFNLP 2009</source>
<year>2009</year>
<fpage>459</fpage>
<lpage>467</lpage> </element-citation> </ref> </ref-list> </back> </pmc></record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Linguistique/explor/TamazightV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000180 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000180 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri{{Explor lien
|wiki= Wicri/Linguistique
|area= TamazightV2
|flux= Pmc
|étape= Corpus
|type= RBID
|clé=
|texte=
}}
| This area was generated with Dilib version V0.6.33. Data generation: Wed Nov 15 18:28:35 2017. Site generation: Sat Feb 10 16:46:27 2024 | |
|