Serveur d'exploration autour du libre accès en Belgique

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Text-mining assisted regulatory annotation

Identifieur interne : 000239 ( Pmc/Corpus ); précédent : 000238; suivant : 000240

Text-mining assisted regulatory annotation

Auteurs : Stein Aerts ; Maximilian Haeussler ; Steven Van Vooren ; Obi L. Griffith ; Paco Hulpiau ; Steven Jm Jones ; Stephen B. Montgomery ; Casey M. Bergman

Source :

RBID : PMC:2374703

Abstract

Text-mining technologies can be integrated with genome annotation systems, increasing the availability of annotated cis-regulatory data.


Url:
DOI: 10.1186/gb-2008-9-2-r31
PubMed: 18271954
PubMed Central: 2374703

Links to Exploration step

PMC:2374703

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Text-mining assisted regulatory annotation</title>
<author>
<name sortKey="Aerts, Stein" sort="Aerts, Stein" uniqKey="Aerts S" first="Stein" last="Aerts">Stein Aerts</name>
<affiliation>
<nlm:aff id="I1">Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, VIB, Leuven, B-3000, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Department of Human Genetics, Katholieke Universiteit Leuven School of Medicine, Herestraat, Leuven, B-3000, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Haeussler, Maximilian" sort="Haeussler, Maximilian" uniqKey="Haeussler M" first="Maximilian" last="Haeussler">Maximilian Haeussler</name>
<affiliation>
<nlm:aff id="I3">Institut de Neurosciences A Fessard, Centre National de la Rechere Scientifique, Gif-sur-Yvette, 91 198, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Van Vooren, Steven" sort="Van Vooren, Steven" uniqKey="Van Vooren S" first="Steven" last="Van Vooren">Steven Van Vooren</name>
<affiliation>
<nlm:aff id="I4">Department of Electrical Engineering, Katholieke Universiteit Leuven, Heverlee, B-3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Griffith, Obi L" sort="Griffith, Obi L" uniqKey="Griffith O" first="Obi L" last="Griffith">Obi L. Griffith</name>
<affiliation>
<nlm:aff id="I5">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, V5Z 4E6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hulpiau, Paco" sort="Hulpiau, Paco" uniqKey="Hulpiau P" first="Paco" last="Hulpiau">Paco Hulpiau</name>
<affiliation>
<nlm:aff id="I6">VIB Department for Molecular Biomedical Research, Ghent University, Ghent, 9052, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jones, Steven Jm" sort="Jones, Steven Jm" uniqKey="Jones S" first="Steven Jm" last="Jones">Steven Jm Jones</name>
<affiliation>
<nlm:aff id="I5">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, V5Z 4E6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Montgomery, Stephen B" sort="Montgomery, Stephen B" uniqKey="Montgomery S" first="Stephen B" last="Montgomery">Stephen B. Montgomery</name>
<affiliation>
<nlm:aff id="I7">Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bergman, Casey M" sort="Bergman, Casey M" uniqKey="Bergman C" first="Casey M" last="Bergman">Casey M. Bergman</name>
<affiliation>
<nlm:aff id="I8">Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester, M13 9PT, UK</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">18271954</idno>
<idno type="pmc">2374703</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2374703</idno>
<idno type="RBID">PMC:2374703</idno>
<idno type="doi">10.1186/gb-2008-9-2-r31</idno>
<date when="2008">2008</date>
<idno type="wicri:Area/Pmc/Corpus">000239</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Text-mining assisted regulatory annotation</title>
<author>
<name sortKey="Aerts, Stein" sort="Aerts, Stein" uniqKey="Aerts S" first="Stein" last="Aerts">Stein Aerts</name>
<affiliation>
<nlm:aff id="I1">Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, VIB, Leuven, B-3000, Belgium</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="I2">Department of Human Genetics, Katholieke Universiteit Leuven School of Medicine, Herestraat, Leuven, B-3000, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Haeussler, Maximilian" sort="Haeussler, Maximilian" uniqKey="Haeussler M" first="Maximilian" last="Haeussler">Maximilian Haeussler</name>
<affiliation>
<nlm:aff id="I3">Institut de Neurosciences A Fessard, Centre National de la Rechere Scientifique, Gif-sur-Yvette, 91 198, France</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Van Vooren, Steven" sort="Van Vooren, Steven" uniqKey="Van Vooren S" first="Steven" last="Van Vooren">Steven Van Vooren</name>
<affiliation>
<nlm:aff id="I4">Department of Electrical Engineering, Katholieke Universiteit Leuven, Heverlee, B-3001, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Griffith, Obi L" sort="Griffith, Obi L" uniqKey="Griffith O" first="Obi L" last="Griffith">Obi L. Griffith</name>
<affiliation>
<nlm:aff id="I5">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, V5Z 4E6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Hulpiau, Paco" sort="Hulpiau, Paco" uniqKey="Hulpiau P" first="Paco" last="Hulpiau">Paco Hulpiau</name>
<affiliation>
<nlm:aff id="I6">VIB Department for Molecular Biomedical Research, Ghent University, Ghent, 9052, Belgium</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Jones, Steven Jm" sort="Jones, Steven Jm" uniqKey="Jones S" first="Steven Jm" last="Jones">Steven Jm Jones</name>
<affiliation>
<nlm:aff id="I5">Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, V5Z 4E6, Canada</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Montgomery, Stephen B" sort="Montgomery, Stephen B" uniqKey="Montgomery S" first="Stephen B" last="Montgomery">Stephen B. Montgomery</name>
<affiliation>
<nlm:aff id="I7">Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bergman, Casey M" sort="Bergman, Casey M" uniqKey="Bergman C" first="Casey M" last="Bergman">Casey M. Bergman</name>
<affiliation>
<nlm:aff id="I8">Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester, M13 9PT, UK</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Genome Biology</title>
<idno type="ISSN">1465-6906</idno>
<idno type="eISSN">1465-6914</idno>
<imprint>
<date when="2008">2008</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Text-mining technologies can be integrated with genome annotation systems, increasing the availability of annotated
<italic>cis</italic>
-regulatory data.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Genome Biol</journal-id>
<journal-title>Genome Biology</journal-title>
<issn pub-type="ppub">1465-6906</issn>
<issn pub-type="epub">1465-6914</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">18271954</article-id>
<article-id pub-id-type="pmc">2374703</article-id>
<article-id pub-id-type="publisher-id">gb-2008-9-2-r31</article-id>
<article-id pub-id-type="doi">10.1186/gb-2008-9-2-r31</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Text-mining assisted regulatory annotation</article-title>
</title-group>
<contrib-group>
<contrib id="A1" corresp="yes" contrib-type="author">
<name>
<surname>Aerts</surname>
<given-names>Stein</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<xref ref-type="aff" rid="I2">2</xref>
<email>stein.aerts@med.kuleuven.be</email>
</contrib>
<contrib id="A2" contrib-type="author">
<name>
<surname>Haeussler</surname>
<given-names>Maximilian</given-names>
</name>
<xref ref-type="aff" rid="I3">3</xref>
<email>maximilianh@gmail.com</email>
</contrib>
<contrib id="A3" contrib-type="author">
<name>
<surname>van Vooren</surname>
<given-names>Steven</given-names>
</name>
<xref ref-type="aff" rid="I4">4</xref>
<email>Steven.VanVooren@esat.kuleuven.ac.be</email>
</contrib>
<contrib id="A4" contrib-type="author">
<name>
<surname>Griffith</surname>
<given-names>Obi L</given-names>
</name>
<xref ref-type="aff" rid="I5">5</xref>
<email>obig@bcgsc.ca</email>
</contrib>
<contrib id="A5" contrib-type="author">
<name>
<surname>Hulpiau</surname>
<given-names>Paco</given-names>
</name>
<xref ref-type="aff" rid="I6">6</xref>
<email>paco.hulpiau@dmbr.ugent.be</email>
</contrib>
<contrib id="A6" contrib-type="author">
<name>
<surname>Jones</surname>
<given-names>Steven JM</given-names>
</name>
<xref ref-type="aff" rid="I5">5</xref>
<email>sjones@bcgsc.ca</email>
</contrib>
<contrib id="A7" contrib-type="author">
<name>
<surname>Montgomery</surname>
<given-names>Stephen B</given-names>
</name>
<xref ref-type="aff" rid="I7">7</xref>
<email>sm8@sanger.ac.uk</email>
</contrib>
<contrib id="A8" corresp="yes" contrib-type="author">
<name>
<surname>Bergman</surname>
<given-names>Casey M</given-names>
</name>
<xref ref-type="aff" rid="I8">8</xref>
<email>casey.bergman@manchester.ac.uk</email>
</contrib>
<contrib id="A9" contrib-type="author">
<collab>The Open Regulatory Annotation Consortium</collab>
<email>oreganno@noaddress.com</email>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
Laboratory of Neurogenetics, Department of Molecular and Developmental Genetics, VIB, Leuven, B-3000, Belgium</aff>
<aff id="I2">
<label>2</label>
Department of Human Genetics, Katholieke Universiteit Leuven School of Medicine, Herestraat, Leuven, B-3000, Belgium</aff>
<aff id="I3">
<label>3</label>
Institut de Neurosciences A Fessard, Centre National de la Rechere Scientifique, Gif-sur-Yvette, 91 198, France</aff>
<aff id="I4">
<label>4</label>
Department of Electrical Engineering, Katholieke Universiteit Leuven, Heverlee, B-3001, Belgium</aff>
<aff id="I5">
<label>5</label>
Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, V5Z 4E6, Canada</aff>
<aff id="I6">
<label>6</label>
VIB Department for Molecular Biomedical Research, Ghent University, Ghent, 9052, Belgium</aff>
<aff id="I7">
<label>7</label>
Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK</aff>
<aff id="I8">
<label>8</label>
Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester, M13 9PT, UK</aff>
<pub-date pub-type="ppub">
<year>2008</year>
</pub-date>
<pub-date pub-type="epub">
<day>13</day>
<month>2</month>
<year>2008</year>
</pub-date>
<volume>9</volume>
<issue>2</issue>
<fpage>R31</fpage>
<lpage>R31</lpage>
<ext-link ext-link-type="uri" xlink:href="http://genomebiology.com/2008/9/2/R31"></ext-link>
<history>
<date date-type="received">
<day>2</day>
<month>10</month>
<year>2007</year>
</date>
<date date-type="rev-recd">
<day>21</day>
<month>12</month>
<year>2007</year>
</date>
<date date-type="accepted">
<day>13</day>
<month>2</month>
<year>2008</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2008 Aerts et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2008</copyright-year>
<copyright-holder>Aerts et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0">
<p>This is an open access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0"></ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</p>
<pmc-comment> Aerts Stein stein.aerts@med.kuleuven.be Text-mining assisted regulatory annotation 2008Genome Biology 9(2): R31-. (2008)1465-6906(2008)9:2urn:ISSN:1465-6906</pmc-comment>
</license>
</permissions>
<abstract abstract-type="short">
<p>Text-mining technologies can be integrated with genome annotation systems, increasing the availability of annotated
<italic>cis</italic>
-regulatory data.</p>
</abstract>
<abstract>
<sec>
<title>Background</title>
<p>Decoding transcriptional regulatory networks and the genomic
<italic>cis</italic>
-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature.</p>
</sec>
<sec>
<title>Results</title>
<p>We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high
<italic>cis</italic>
-regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated
<italic>cis</italic>
-regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high
<italic>cis</italic>
-regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the
<italic>cis</italic>
-regulatory annotation process.</p>
</sec>
<sec>
<title>Conclusion</title>
<p>Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated
<italic>cis</italic>
-regulatory data needed to catalyze advances in the field of gene regulation.</p>
</sec>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>The process of annotation is an essential first step in attributing biological information to genome sequences. Traditionally, the main focus of genome annotation has been the identification and annotation of well-studied biological entities, such as protein-coding genes, RNA genes and repetitive DNA. Efforts to annotate these genomic features typically adopt one of several established annotation paradigms - the 'museum,' 'jamboree,' 'cottage industry,' or 'factory' models of genome annotation (reviewed in [
<xref ref-type="bibr" rid="B1">1</xref>
,
<xref ref-type="bibr" rid="B2">2</xref>
]). Other important functional regions of genomes that are more difficult to predict by
<italic>ab initio </italic>
or homology methods are often omitted from the standard genome annotation process, in particular the
<italic>cis</italic>
-regulatory sequences that control transcription. Instead,
<italic>cis</italic>
-regulatory sequences are typically annotated by manual curation from the literature either under the museum model in the private domain [
<xref ref-type="bibr" rid="B3">3</xref>
] or under a 'boutique' model [
<xref ref-type="bibr" rid="B4">4</xref>
] in the public domain, whereby small teams curate organism- or process-specific datasets from the primary literature for short-term research purposes. Such decentralized resources are disseminated and maintained in
<italic>ad hoc </italic>
ways that are often not integrated with the major genome database resources, and can present a bewildering array of choices to the computational or experimental end-user.</p>
<p>Recently, two efforts have been launched to develop integrated portals for
<italic>cis</italic>
-regulatory annotation - ORegAnno [
<xref ref-type="bibr" rid="B5">5</xref>
] and PAZAR [
<xref ref-type="bibr" rid="B4">4</xref>
] - that aim to support research in
<italic>cis</italic>
-regulatory sequence and network analysis. Both ORegAnno and PAZAR provide principled, standardized technologies for the long-term, community-driven, open-access annotation of
<italic>cis</italic>
-regulatory data in the context of the major genome database resources (for example, National Center for Biotechnology Information (NCBI), Ensembl, University of California Santa Cruz (UCSC)) and, as such, represent a new generation of resources for the annotation of
<italic>cis</italic>
-regulatory data. Despite these advances in infrastructure, many challenges still remain for the comprehensive community-based annotation of
<italic>cis</italic>
-regulatory data. First, as with all decentralized annotation efforts, community annotation of regulatory data from the literature requires systems to track the curation process, including 'triaging' relevant and irrelevant articles and monitoring the curation status of papers. Second, the scale of the
<italic>cis</italic>
-regulatory annotation challenge remains unknown, and thus it is critical to identify and prioritize the set of documents with high
<italic>cis</italic>
-regulatory potential for curation. Third, with curation times currently on the order of approximately one to two hours per paper, a major bottleneck remains in how to efficiently extract
<italic>cis</italic>
-regulatory data from primary text. Recently, rule-based information extraction systems have been developed to extract regulatory relations among pairs of genes and proteins [
<xref ref-type="bibr" rid="B6">6</xref>
-
<xref ref-type="bibr" rid="B8">8</xref>
]; however, many other types of data are necessary for comprehensive
<italic>cis</italic>
-regulatory annotation, such as the organism under investigation and, perhaps most importantly, the sequence and genomic location of
<italic>cis</italic>
-regulatory elements.</p>
<p>We have attempted to solve some of these challenges through the use of text-mining techniques to retrieve and extract relevant documents and data for the annotation of
<italic>cis</italic>
-regulatory networks and sequences. These efforts were inspired by (and conducted in part through) the RegCreative Jamboree [
<xref ref-type="bibr" rid="B9">9</xref>
], a workshop that was held in late 2006 that attempted to explore the interface between regulatory bioinformatics and text-mining communities. Elsewhere [
<xref ref-type="bibr" rid="B10">10</xref>
], we detail the development of a literature management system for the regulatory annotation community, which warehouses the set of papers that are likely to contain
<italic>cis</italic>
-regulatory data and maintains information on their current curation status. Here we develop a vector space model to identify Medline abstracts of papers that are likely to have high
<italic>cis</italic>
-regulatory content, and use this model to demonstrate that document relevance ranking can assist the annotation of transcriptional regulatory networks and be used to estimate the scale of the regulatory curation challenge. In addition, we show that DNA sequences can be extracted from full-text articles and mapped to genome sequences as a means to identify the location, organism and target gene information that is critical to the
<italic>cis</italic>
-regulatory annotation process. Collectively, our results demonstrate the utility (and the necessity) of employing text-mining approaches to accelerate the community-driven annotation of
<italic>cis</italic>
-regulatory sequences and networks that control transcription.</p>
</sec>
<sec>
<title>Results</title>
<sec>
<title>A literature management system for community annotation and text mining</title>
<p>Assembling the set of documents that are relevant for annotation and tracking the curatorial status of papers are major challenges in community annotation. To help overcome these issues, we have developed a literature management 'queue' for the ORegAnno database, which allows registered users to input papers with known or suspected
<italic>cis</italic>
-regulatory content as targets for curation using their PubMed identifiers (PMIDs). A full description of the ORegAnno Publication Queue and its features is detailed elsewhere [
<xref ref-type="bibr" rid="B10">10</xref>
]; here, we briefly describe its contents to aid interpretation of our text-mining results. The ORegAnno Publication Queue was initially populated with expert entries obtained from the set of papers in ORegAnno plus existing sources of curated publications, including the
<italic>Drosophila </italic>
DNase I Footprint Database [
<xref ref-type="bibr" rid="B11">11</xref>
], REDfly [
<xref ref-type="bibr" rid="B12">12</xref>
], a catalog of regulatory elements for muscle-specific regulation of transcription [
<xref ref-type="bibr" rid="B13">13</xref>
,
<xref ref-type="bibr" rid="B14">14</xref>
], ABS [
<xref ref-type="bibr" rid="B15">15</xref>
], TRED [
<xref ref-type="bibr" rid="B16">16</xref>
], ooTFD [
<xref ref-type="bibr" rid="B17">17</xref>
] and DBTGR [
<xref ref-type="bibr" rid="B18">18</xref>
]. Additionally, a large number of papers were added manually by individual ORegAnno users from literature searches and review articles. Together, these PMIDs form the 'expert entry' component of the ORegAnno Publication Queue. In the current work, we show how, in addition to offering a powerful literature management system for community annotation, the ORegAnno Publication Queue offers a rich source of PMIDs for assessing information retrieval and information extraction techniques applied to biomedical text in the
<italic>cis</italic>
-regulatory domain.</p>
</sec>
<sec>
<title>A vector space model identifies Medline abstracts with high
<italic>cis</italic>
-regulatory content</title>
<p>As a first step in employing text-mining to aid
<italic>cis</italic>
-regulatory annotation, we attempted to identify a set of full-text papers that could enter the curation process by using information retrieval technology. To do this, we implemented a vector space model [
<xref ref-type="bibr" rid="B19">19</xref>
] that scores the approximately 16 million scientific abstracts from Medline, each represented as a vector of index terms, against a model trained on a corpus of abstracts that
<italic>a priori </italic>
are known to have high
<italic>cis</italic>
-regulatory content. For initial model training purposes, 3,626 abstracts retrieved with the Pubmed query 'transcription and regulation and 'binding site' and (promoter or enhancer)' (see Materials and methods for details) were first split into two equal parts that form a training set (
<italic>POS1</italic>
) and a validation set (
<italic>POS2</italic>
).
<italic>POS1 </italic>
contains 3,344 terms after stemming and stop-word removal, representing vocabulary
<italic>VOC1</italic>
. We compared ten different relevancy rankings with
<italic>POS1 </italic>
as query and either the complete
<italic>VOC1 </italic>
or different subsets of
<italic>VOC1 </italic>
as vocabulary. A vocabulary consisting of the 1,000 terms with the highest frequency in the full corpus yielded the highest performance when applied to
<italic>POS2 </italic>
(results not shown). Similar results were obtained using a training set of 6,306 abstracts from papers previously curated in ORegAnno [
<xref ref-type="bibr" rid="B5">5</xref>
], TRANSFAC [
<xref ref-type="bibr" rid="B3">3</xref>
], or FlyReg [
<xref ref-type="bibr" rid="B11">11</xref>
]. Thus, we chose to develop our relevance ranking based on our '
<italic>cis</italic>
-regulatory' PubMed query to avoid biases towards data type, species, or other unknown factors. This approach has the additional advantage that existing sets of curated papers can legitimately be used later as validation sets. To generate the final relevancy ranking of Medline used in further analyses we used a model based on the 1,000 terms (from the 3,626 training abstracts) with the highest corpus frequency as vocabulary. Figure
<xref ref-type="fig" rid="F1">1</xref>
shows the distribution of the final similarity scores for all approximately 16 million abstracts in Medline, with an indication of the top 10,000, top 50,000 and top 100,000 highest scoring abstracts in the distribution (these lists are called top10k, top50k, top100k and so on throughout the following text).</p>
<fig position="float" id="F1">
<label>Figure 1</label>
<caption>
<p>Distribution of cosine similarity scores between the query vector and each of the Medline abstract vectors, indicating the 10,000th (blue diamond) 50,000th (red diamond) and 100,000th (green diamond) ranked abstract.</p>
</caption>
<graphic xlink:href="gb-2008-9-2-r31-1"></graphic>
</fig>
<p>Using a similarity-based ranking rather than a classification procedure is particularly useful for our task because it does not require a negative training set, and because a similarity score allows a prioritization of documents for curation rather than a binary decision. To evaluate whether our similarity-based ranking agrees with other information retrieval technologies, we classified the entire 16 million Medline abstracts using a support vector machine (SVM) [
<xref ref-type="bibr" rid="B20">20</xref>
,
<xref ref-type="bibr" rid="B21">21</xref>
] trained on the same set of papers from our initial PubMed query as positives, and an equivalent number of randomly selected Medline abstracts as negatives. Using a radial basis function kernel, we find that 169,402 (1.07%) Medline abstracts are classified as positive and 95.6% of the top100k abstracts identified by our cosine similarity method are called positive by the SVM approach. Cosine similarity values and SVM decision function values are, furthermore, highly correlated (Pearson correlation coefficient is 0.88); 78.4% of abstracts are shared by the top100k when ranked by their cosine or SVM scores. Therefore, the cosine similarity and SVM methods both point to a very large but similar set of abstracts in Medline as having high
<italic>cis</italic>
-regulatory potential.</p>
<p>The coverage of several validation sets within the final ranking is shown in Table
<xref ref-type="table" rid="T1">1</xref>
. Before calculating the sensitivity (recall) for each validation set, we removed all Medline abstracts from these sets that were also part of the training set. As a first validation set we used TRANSFAC [
<xref ref-type="bibr" rid="B3">3</xref>
], a commercial database of manually curated transcription factor binding sites (TFBSs). We collected all 5,719 PMIDs from TRANSFAC (v10.4) that are linked to a curated TFBS. Of the set of 5,183 independent TRANSFAC PMIDs (536 were part of the training set), 75.4% are found within the top50k and 88.2% within the top100k abstracts. This shows that our model is able to generalize and recover many true positive abstracts with high
<italic>cis</italic>
-regulatory content. In fact, the vector space model realizes an increase in the proportion of TRANSFAC PMIDs from 14.7% in the 3,626 papers based on the initial PubMed query to 18.8% in the top 3,626 publications after relevancy ranking. Likewise, using a second validation set of 186 independent positive PMIDs from the FlyReg database of curated TFBSs in
<italic>Drosophila</italic>
, we find high sensitivities of 78.5% and 89.2% of FlyReg PMIDs in the top50k and top100k scoring abstracts in Medline, respectively.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption>
<p>Coverage of validation sets (excluding PMIDs in the training set) within the top10k, top50k, and top100k ranked abstracts for the vector space model relevancy ranking</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td></td>
<td align="center">TRANSFAC</td>
<td align="center">FlyReg</td>
<td align="center">ORegAnno Queue</td>
<td align="center">ORegAnno prior to RegCreative</td>
<td align="center">RegCreative success</td>
<td align="center">RegCreative failure</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Number of PMIDs</td>
<td align="center">5,719</td>
<td align="center">200</td>
<td align="center">4,145</td>
<td align="center">376</td>
<td align="center">260</td>
<td align="center">218</td>
</tr>
<tr>
<td align="left">Number of PMIDs (no training data)</td>
<td align="center">5,183</td>
<td align="center">186</td>
<td align="center">3,687</td>
<td align="center">340</td>
<td align="center">228</td>
<td align="center">212</td>
</tr>
<tr>
<td align="left">Number in top10k</td>
<td align="center">1,390</td>
<td align="center">38</td>
<td align="center">1,035</td>
<td align="center">89</td>
<td align="center">59</td>
<td align="center">18</td>
</tr>
<tr>
<td align="left">Percent in top10k</td>
<td align="center">26.8%</td>
<td align="center">20.4%</td>
<td align="center">28.1%</td>
<td align="center">26.2%</td>
<td align="center">25.9%</td>
<td align="center">8.5%</td>
</tr>
<tr>
<td align="left">Number in top50k</td>
<td align="center">3,908</td>
<td align="center">146</td>
<td align="center">2,753</td>
<td align="center">260</td>
<td align="center">165</td>
<td align="center">79</td>
</tr>
<tr>
<td align="left">Percent in top50k</td>
<td align="center">75.4%</td>
<td align="center">78.5%</td>
<td align="center">74.7%</td>
<td align="center">76.5%</td>
<td align="center">72.4%</td>
<td align="center">37.3%</td>
</tr>
<tr>
<td align="left">Number in top100k</td>
<td align="center">4,572</td>
<td align="center">166</td>
<td align="center">3,208</td>
<td align="center">301</td>
<td align="center">199</td>
<td align="center">110</td>
</tr>
<tr>
<td align="left">Percent in top100k</td>
<td align="center">88.2%</td>
<td align="center">89.2%</td>
<td align="center">87.0%</td>
<td align="center">88.5%</td>
<td align="center">87.3%</td>
<td align="center">51.9%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Next, we investigated the coverage of true positive abstracts using curated papers from the ORegAnno database [
<xref ref-type="bibr" rid="B5">5</xref>
], including those curated as a part of the RegCreative Jamboree [
<xref ref-type="bibr" rid="B9">9</xref>
]. Prior to the Publication Queue, ORegAnno contained 376 curated papers, of which 340 are not part of the training set in the vector space model. Of these, 88.5% (n = 301) are covered in the top100k. Since the creation of the Publication Queue, curated papers are flagged with 'failure' or 'success,' depending on whether they contained enough data to allow the creation of a full ORegAnno record (that is, either a regulatory region or a TFBS with all required fields; see above). Surprisingly, in a set of 478 papers from the ORegAnno Publication Queue (see above) that were known
<italic>a priori </italic>
to have a high likelihood of containing curatable
<italic>cis</italic>
-regulatory data, only 54.4% (n = 260) were confirmed as 'success' papers during the RegCreative Jamboree. The remaining 218 'failure' papers contained either no regulatory data, or one or more critical data fields were missing (for example, the regulatory sequence could not be identified or unambiguously mapped to a target gene or species). Excluding training abstracts, 87.3% (n = 199) of the success papers are found in the top100k but only 51.9% (n = 110) of the failure papers are found in the top100k, indicating that our relevance ranking increases the likelihood that a paper has curatable
<italic>cis</italic>
-regulatory data. Collectively, these experiments show that our vector space model successfully identifies and ranks papers with enriched
<italic>cis</italic>
-regulatory content based on Medline abstracts, and that information retrieval techniques can be used to populate a larger ORegAnno Publication Queue to assist the community annotation of
<italic>cis</italic>
-regulatory data.</p>
</sec>
<sec>
<title>Estimating the size of the
<italic>cis</italic>
-regulatory corpus</title>
<p>Although the sensitivities of our vector space model on evaluation sets are high, the calculations were performed on large sets of PMIDs (10k, 50k or 100k), meaning that the majority of candidate papers do not fall into any of the existing sets of curated papers. To investigate the degree to which the additional predictions show high true positive rates, we conducted a validation experiment that also gives us an indication of the scale of the
<italic>cis</italic>
-regulatory annotation challenge. We constructed a sample of 200 PMIDs evenly spaced every 500 abstracts across the top100k abstracts. Full-text papers for these 200 samples were subjected to a 'pseudo-curation' procedure in which the paper was read by an expert and, instead of being fully curated, was only scored with respect to its 'curatability' for containing a TFBS (see Materials and methods). This experiment allowed us to estimate how the proportion of true positives and false positives vary as a function of position in the ranked list of the top100k scoring Medline abstracts. Figure
<xref ref-type="fig" rid="F2">2</xref>
shows the positive predictive value (PPV) for each threshold of the top100k. The first 10 samples were all success papers, indicating that the top scoring 4,501 papers are extremely likely to contain curatable
<italic>cis</italic>
-regulatory data. From then onwards, the PPV starts to decrease but still remains above 30% for the entire top100k scoring abstracts. This curve can be used to determine an optimal threshold for including papers in the ranked Medline list into the ORegAnno Publication Queue. As noted above, the proportion of success papers from the expert-entry ORegAnno Publication Queue was 54.4% during the RegCreative Jamboree. To achieve a similar curation success rate in the set of papers identified by the vector space model (namely PPV approximately 50%), we would include the top 58,000 scoring abstracts. Therefore, we estimate that the scale of the full corpus with curatable
<italic>cis</italic>
-regulatory data in Medline is on the order of approximately 30,000 papers. We note that this is a conservative measure because the success criteria are strict. Indeed, among the failure papers are many that contain regulatory data or references to other potential success papers (Figure
<xref ref-type="fig" rid="F3">3</xref>
). Based on these results, we added PMIDs and ranks for the top 58,000 scoring papers in Medline as 'text-mining entries' to the ORegAnno Publication Queue.</p>
<fig position="float" id="F2">
<label>Figure 2</label>
<caption>
<p>PPV calculated for each threshold in the top100k of the final relevancy ranking, using the pseudo-curation results of 200 evenly distributed samples. The length of the final 'text-mining entry' component of the ORegAnno Publication Queue was chosen at 58,000, which yields a PPV of 50%.</p>
</caption>
<graphic xlink:href="gb-2008-9-2-r31-2"></graphic>
</fig>
<fig position="float" id="F3">
<label>Figure 3</label>
<caption>
<p>Results of the pseudo-curation procedure on 200 evenly distributed samples across the top100k.</p>
</caption>
<graphic xlink:href="gb-2008-9-2-r31-3"></graphic>
</fig>
</sec>
<sec>
<title>Abstract relevance ranking aids the construction of regulatory networks</title>
<p>To illustrate the utility of identifying papers with high
<italic>cis</italic>
-regulatory content, we queried the top58k scoring abstracts for a particular transcription factor (TF), namely the
<italic>Drosophila </italic>
homeodomain-containing gene
<italic>even</italic>
-
<italic>skipped </italic>
(
<italic>eve</italic>
). Our goal was to use the set of papers enriched for
<italic>cis</italic>
-regulatory content to construct a literature-based transcriptional regulatory network focused on the upstream regulating factors and downstream target genes (TGs) of
<italic>eve</italic>
, based on high-quality published TFBS data. For this experiment we started with the entire list of 664 references associated with
<italic>eve </italic>
in FlyBase [
<xref ref-type="bibr" rid="B22">22</xref>
], which also includes papers not related to
<italic>cis</italic>
-regulatory data (for example, genetic interactions). We cross-referenced this list of all papers on
<italic>eve </italic>
with the top58k list to filter for papers on
<italic>eve </italic>
that are likely to contain
<italic>cis</italic>
-regulatory data. Of the 664
<italic>eve </italic>
papers, 88 are found in the top58k list (147 are in the top100k), and for 85 of those (144 for the top100k) we retrieved the full PDF paper. We conducted a pseudo-curation analysis on these 85 papers to identify those that reported binary TF→TG relationships. We classified 35 out of these 85 candidates as 'success' papers, which revealed 43 unique binary TF→TG relationships (there were 47 relationships in total, including 4 relationships that occurred twice), 20 of which involved
<italic>eve </italic>
either as TF or as TG. A summary of the identified regulatory interactions is presented in Figure
<xref ref-type="fig" rid="F4">4</xref>
as a network constructed using Cytoscape [
<xref ref-type="bibr" rid="B23">23</xref>
]. By comparison with previously curated binary TF→TG relationships for
<italic>eve </italic>
in the FlyReg database [
<xref ref-type="bibr" rid="B11">11</xref>
], our automated document retrieval process recovered 100% (12 of 12) of known upstream activating TFs, and 85% (6 of 7) of known downstream TGs. The only downstream TG curated in FlyReg that was missing in this analysis was
<italic>Abdominal-A </italic>
(
<italic>Abd-A</italic>
), which was omitted because it was not present in the original list of
<italic>eve</italic>
-related papers curated by FlyBase. These results show that cross-referencing general PMID lists for a given gene against our vector space model can enrich for papers that report direct
<italic>cis</italic>
-regulatory interactions for that gene, that transcriptional regulatory networks can be assembled from text-extracted binary TF→TG relationships [
<xref ref-type="bibr" rid="B6">6</xref>
-
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B24">24</xref>
], and that TF→TG interactions may be extracted from text even when full curation of
<italic>cis</italic>
-regulatory sequences may not be possible.</p>
<fig position="float" id="F4">
<label>Figure 4</label>
<caption>
<p>Transcriptional regulatory sub-network around the
<italic>Drosophila </italic>
transcription factor
<italic>even-skipped </italic>
(
<italic>eve</italic>
). All nodes and edges were retrieved from
<italic>eve</italic>
-related publications in the top100k abstract list. Black edges are success papers (that is, fully curatable publications); grey edges are failure papers that report regulatory data (for example, consensus sites) but are not the primary reference; grey dashed edges are failure papers that contain regulatory data that are not complete enough to allow full curation; blue edges are failures that report protein-protein interactions.</p>
</caption>
<graphic xlink:href="gb-2008-9-2-r31-4"></graphic>
</fig>
</sec>
<sec>
<title>Full-text articles contain
<italic>cis</italic>
-regulatory sequences that can be automatically mapped to genomes</title>
<p>We also evaluated the possibility of automatically annotating
<italic>cis</italic>
-regulatory sequences from publications with high
<italic>cis</italic>
-regulatory content by extracting DNA-like strings from text and mapping these putative DNA sequences to genomes. Previously, it has been shown that short protein and nucleic acid sequence strings can be extracted from text with high precision, and that many extracted DNA sequences correspond to regulatory sequences or motifs [
<xref ref-type="bibr" rid="B25">25</xref>
]. Using automated downloads of full-text articles based on the NCBI eutils, followed by HTML-scanning for links that end with 'pdf,' we obtained PDFs for 86.9% (n = 9,940) of 11,437 papers with high
<italic>cis</italic>
-regulatory content. This recovery rate of PDFs from PMID lists is slightly higher than a rate of 79.6% reported for papers on bacterial gene regulation [
<xref ref-type="bibr" rid="B8">8</xref>
]. We converted 95.0% (9,440/9,940) of full-text PDFs into plain text files of greater than 2,000 bytes, a cutoff that represented the lower size of converted files with
<italic>cis</italic>
-regulatory content based on manual inspection. We extracted DNA-like strings from 85.4% (8,066/9,440) of these text files using a rule-based approach involving regular expressions and word size cutoffs (see Materials and methods). In total, we obtained nearly 2.8 Mb of DNA-like text from these 8,066 papers. We obtained BLAST hits of 10e-5 or greater to at least one of the five genomes under investigation for DNA sequences from 36.9% (2,975/8,066) of the PMIDs with extractable fasta sequence. Numbers of documents obtained at each stage of the process for the different source PMID lists are shown in Table
<xref ref-type="table" rid="T2">2</xref>
. Overall, the proportion of papers with sequences that can be mapped to one of the five genomes is 26.0% (2,975/11,437), with the lowest efficiency step being the mapping of short sequence elements to genomes. Similar results were obtained using a previously reported Markov chain method [
<xref ref-type="bibr" rid="B25">25</xref>
] to extract DNA sequences from full-text (data not shown), with differences mainly attributable to the inclusion of lowercase DNA characters by the method of Wren
<italic>et al. </italic>
[
<xref ref-type="bibr" rid="B25">25</xref>
].</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption>
<p>Efficiency of document recovery, sequence extraction and genome mapping for the source lists of PMIDs with high
<italic>cis</italic>
-regulatory content</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td></td>
<td align="center">TRANSFAC</td>
<td align="center">FlyReg</td>
<td align="center">ORegAnno</td>
<td align="center">Queue</td>
<td align="center">top4,501</td>
<td align="center">All</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Number of PMIDs</td>
<td align="center">5,719</td>
<td align="center">202</td>
<td align="center">914</td>
<td align="center">4,145</td>
<td align="center">4,491</td>
<td align="center">11,437</td>
</tr>
<tr>
<td align="left">Number of PMIDs with PDF</td>
<td align="center">5,302</td>
<td align="center">187</td>
<td align="center">835</td>
<td align="center">3,710</td>
<td align="center">3,677</td>
<td align="center">9,940</td>
</tr>
<tr>
<td align="left">Percent PMIDs with PDF</td>
<td align="center">92.7%</td>
<td align="center">92.6%</td>
<td align="center">91.4%</td>
<td align="center">89.5%</td>
<td align="center">81.9%</td>
<td align="center">86.9%</td>
</tr>
<tr>
<td align="left">Number of PMIDs with text >2 Kbytes</td>
<td align="center">5,051</td>
<td align="center">175</td>
<td align="center">793</td>
<td align="center">3,517</td>
<td align="center">3,498</td>
<td align="center">9,440</td>
</tr>
<tr>
<td align="left">Percent PMIDs with text >2 Kbytes</td>
<td align="center">88.3%</td>
<td align="center">86.6%</td>
<td align="center">86.8%</td>
<td align="center">84.8%</td>
<td align="center">77.9%</td>
<td align="center">82.5%</td>
</tr>
<tr>
<td align="left">Efficiency of text conversion</td>
<td align="center">95.3%</td>
<td align="center">93.6%</td>
<td align="center">95.0%</td>
<td align="center">94.8%</td>
<td align="center">95.1%</td>
<td align="center">95.0%</td>
</tr>
<tr>
<td align="left">Number of PMIDs with fasta sequence</td>
<td align="center">4,357</td>
<td align="center">155</td>
<td align="center">660</td>
<td align="center">3,044</td>
<td align="center">3,080</td>
<td align="center">8,066</td>
</tr>
<tr>
<td align="left">Percent PMIDs with fasta sequence</td>
<td align="center">76.2%</td>
<td align="center">76.7%</td>
<td align="center">72.2%</td>
<td align="center">73.4%</td>
<td align="center">68.6%</td>
<td align="center">70.5%</td>
</tr>
<tr>
<td align="left">Efficiency of sequence extraction</td>
<td align="center">86.3%</td>
<td align="center">88.6%</td>
<td align="center">83.2%</td>
<td align="center">86.6%</td>
<td align="center">88.1%</td>
<td align="center">85.4%</td>
</tr>
<tr>
<td align="left">Number of PMIDs with fasta sequence mapped to genome</td>
<td align="center">1,518</td>
<td align="center">75</td>
<td align="center">303</td>
<td align="center">1,279</td>
<td align="center">1,260</td>
<td align="center">2,975</td>
</tr>
<tr>
<td align="left">Percent PMIDs with fasta sequence mapped to genome</td>
<td align="center">26.5%</td>
<td align="center">37.1%</td>
<td align="center">33.2%</td>
<td align="center">30.9%</td>
<td align="center">28.1%</td>
<td align="center">26.0%</td>
</tr>
<tr>
<td align="left">Efficiency of genome mapping</td>
<td align="center">34.8%</td>
<td align="center">48.4%</td>
<td align="center">45.9%</td>
<td align="center">42.0%</td>
<td align="center">40.9%</td>
<td align="center">36.9%</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Note that totals are less than the sum of the sets since many PMIDs are found in more than one source list.</p>
</table-wrap-foot>
</table-wrap>
<p>To provide biologically meaningful
<italic>cis</italic>
-regulatory annotations, automatic text-based sequence extraction must identify genomic regions that match true
<italic>cis</italic>
-regulatory elements but not a large number of other irrelevant features. To test this we used a set of 3,208 regulatory elements with known genomic location from a list of 850 'evaluation' papers with manually curated entries in ORegAnno. Three papers (PMIDs 12566409 [
<xref ref-type="bibr" rid="B26">26</xref>
], 17086198 [
<xref ref-type="bibr" rid="B27">27</xref>
] and 17558387 [
<xref ref-type="bibr" rid="B28">28</xref>
]) with 947 ORegAnno records from high-throughput experiments in humans that were imported in bulk into ORegAnno were omitted from this analysis. The numbers of regulatory elements annotated in ORegAnno, regions mapped with extracted text, and their overlap are shown in Table
<xref ref-type="table" rid="T3">3</xref>
. Overall, the PPV of our approach is reasonably high (64.8%), typically with lower PPV in large mammalian genomes (42.2-70.6%) and higher PPV in small invertebrate genomes (79.3-81.3%). At the
<italic>cis</italic>
-regulatory element level, sequences overlapping approximately 33% of known ORegAnno annotations overall can be obtained directly from primary text and mapped to genomes. For
<italic>Drosophila melanogaster</italic>
, we find that text-based regulatory sequence extraction can yield annotations that have a higher PPV but lower sensitivity than the best
<italic>de novo </italic>
regulatory element prediction methods [
<xref ref-type="bibr" rid="B29">29</xref>
]. Higher sensitivities for text-based regulatory sequence prediction are observed in mouse and rat (58.4-59.8%) relative to human, worms and flies (12.4-32.8%), which can be explained by the fact that these latter species have been the subject of dedicated annotation efforts in ORegAnno and are likely to contain a deeper level of human inference in their annotation. Since only 54.4% of papers were deemed 'success' papers in the RegCreative Jamboree (see above), these relatively low sensitivities are perhaps not surprising and indicate that, in some species, we may be achieving sensitivities approaching the upper bound of what is possible automatically. An example of the accuracy and utility of text-based regulatory sequence extraction is shown in Figure
<xref ref-type="fig" rid="F5">5</xref>
. The
<italic>Hsp70 </italic>
promoter region is duplicated seven times in the
<italic>D. melanogaster </italic>
genome, with only one locus currently annotated in FlyReg (
<italic>Hsp70Ab</italic>
). Our method cleanly extracts and correctly maps several
<italic>Hsp70 </italic>
regulatory elements from full-text to genome coordinates, both from previously annotated ('evaluation') papers plus other ('prediction') papers not currently annotated in ORegAnno (Figure
<xref ref-type="fig" rid="F5">5a</xref>
). In addition, the unbiased nature of our method improves the current annotation of
<italic>Hsp70 </italic>
regulatory sequences in
<italic>Drosophila</italic>
, with text hits mapping to all six copies of the
<italic>Hsp70 </italic>
gene as well as the promoter region of the
<italic>α-γ-element </italic>
noncoding RNA gene that is expressed in response to heat shock [
<xref ref-type="bibr" rid="B30">30</xref>
,
<xref ref-type="bibr" rid="B31">31</xref>
] (Figure
<xref ref-type="fig" rid="F5">5a,b</xref>
).</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption>
<p>Performance of text-based sequence extraction for
<italic>cis</italic>
-regulatory annotation</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<td></td>
<td align="center">dm2</td>
<td align="center">hg18</td>
<td align="center">mm8</td>
<td align="center">ce2</td>
<td align="center">rn4</td>
<td align="center">All</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Number of ORegAnno annotations</td>
<td align="center">2,079</td>
<td align="center">589</td>
<td align="center">255</td>
<td align="center">178</td>
<td align="center">107</td>
<td align="center">3,208</td>
</tr>
<tr>
<td align="left">Number of PMIDs with ORegAnno annotation</td>
<td align="center">389</td>
<td align="center">283</td>
<td align="center">113</td>
<td align="center">30</td>
<td align="center">48</td>
<td align="center">850</td>
</tr>
<tr>
<td align="left">Number of PMIDs with Ensembl target gene name(s)</td>
<td align="center">388</td>
<td align="center">253</td>
<td align="center">107</td>
<td align="center">29</td>
<td align="center">42</td>
<td align="center">819</td>
</tr>
<tr>
<td align="left">Number of text hits from PMIDs with ORegAnno annotation</td>
<td align="center">188</td>
<td align="center">128</td>
<td align="center">51</td>
<td align="center">16</td>
<td align="center">32</td>
<td align="center">415</td>
</tr>
<tr>
<td align="left">Number of text hits that overlap ORegAnno annotation</td>
<td align="center">149</td>
<td align="center">54</td>
<td align="center">36</td>
<td align="center">13</td>
<td align="center">17</td>
<td align="center">269</td>
</tr>
<tr>
<td align="left">Percent text hits that overlap ORegAnno annotation (PPV)</td>
<td align="center">79.3%</td>
<td align="center">42.2%</td>
<td align="center">70.6%</td>
<td align="center">81.3%</td>
<td align="center">53.1%</td>
<td align="center">64.8%</td>
</tr>
<tr>
<td align="left">Number of ORegAnno annotations overlapped by a text hits</td>
<td align="center">681</td>
<td align="center">133</td>
<td align="center">149</td>
<td align="center">22</td>
<td align="center">64</td>
<td align="center">1,049</td>
</tr>
<tr>
<td align="left">Percent ORegAnno annotations overlapped by a text hits (SN)</td>
<td align="center">32.8%</td>
<td align="center">22.6%</td>
<td align="center">58.4%</td>
<td align="center">12.4%</td>
<td align="center">59.8%</td>
<td align="center">32.7%</td>
</tr>
<tr>
<td align="left">Number of PMIDs with text hits</td>
<td align="center">124</td>
<td align="center">91</td>
<td align="center">44</td>
<td align="center">12</td>
<td align="center">24</td>
<td align="center">295</td>
</tr>
<tr>
<td align="left">Percent PMIDs with text hits (coverage)</td>
<td align="center">31.9%</td>
<td align="center">32.2%</td>
<td align="center">38.9%</td>
<td align="center">40.0%</td>
<td align="center">50.0%</td>
<td align="center">32.2%</td>
</tr>
<tr>
<td align="left">Number of PMIDs with text hits to correct species</td>
<td align="center">123</td>
<td align="center">84</td>
<td align="center">37</td>
<td align="center">12</td>
<td align="center">18</td>
<td align="center">274</td>
</tr>
<tr>
<td align="left">Percent PMIDs with text hits to correct species (PPV)</td>
<td align="center">99.2%</td>
<td align="center">92.3%</td>
<td align="center">84.1%</td>
<td align="center">100.0%</td>
<td align="center">75.0%</td>
<td align="center">92.9%</td>
</tr>
<tr>
<td align="left">Number of PMIDs with text hits and Ensembl target gene name(s)</td>
<td align="center">122</td>
<td align="center">77</td>
<td align="center">33</td>
<td align="center">11</td>
<td align="center">16</td>
<td align="center">259</td>
</tr>
<tr>
<td align="left">Number of PMIDs with text hits and perfect match to correct target gene name(s)</td>
<td align="center">67</td>
<td align="center">57</td>
<td align="center">24</td>
<td align="center">4</td>
<td align="center">10</td>
<td align="center">162</td>
</tr>
<tr>
<td align="left">Number of PMIDs with text hits and partial match to correct target gene name(s)</td>
<td align="center">16</td>
<td align="center">12</td>
<td align="center">5</td>
<td align="center">3</td>
<td align="center">4</td>
<td align="center">40</td>
</tr>
<tr>
<td align="left">Percent PMIDs with text hits and match to correct target gene name (PPV)</td>
<td align="center">68.0%</td>
<td align="center">89.6%</td>
<td align="center">87.9%</td>
<td align="center">63.6%</td>
<td align="center">87.5%</td>
<td align="center">78.0%</td>
</tr>
<tr>
<td align="left">Number of PMIDs without ORegAnno annotation with text hits</td>
<td align="center">76</td>
<td align="center">1,291</td>
<td align="center">841</td>
<td align="center">13</td>
<td align="center">459</td>
<td align="center">2,680</td>
</tr>
<tr>
<td align="left">Number of text hits from PMIDs without ORegAnno annotation</td>
<td align="center">126</td>
<td align="center">2,602</td>
<td align="center">2,131</td>
<td align="center">14</td>
<td align="center">1,002</td>
<td align="center">5,875</td>
</tr>
<tr>
<td align="left">Number of text hits from PMIDs without ORegAnno annotation that overlap ORegAnno annotation</td>
<td align="center">59</td>
<td align="center">202</td>
<td align="center">58</td>
<td align="center">1</td>
<td align="center">18</td>
<td align="center">338</td>
</tr>
<tr>
<td align="left">Number of ORegAnno annotations overlapped by text hits from PMIDs without ORegAnno annotation</td>
<td align="center">200</td>
<td align="center">347</td>
<td align="center">139</td>
<td align="center">3</td>
<td align="center">33</td>
<td align="center">722</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig position="float" id="F5">
<label>Figure 5</label>
<caption>
<p>Comparison of automatically extracted text-based annotation and manual annotation of the
<italic>D. melanogaster Hsp70 </italic>
gene regions.
<bold>(a) </bold>
The
<italic>Hsp70Aa-Ab </italic>
region.
<bold>(b) </bold>
The
<italic>Hsp70Ba-Bc </italic>
region. The 'evaluation' track refers to text-based hits extracted from papers with curated regulatory data in ORegAnno; the 'prediction' track refers to text hits extracted from papers not currently curated in ORegAnno, but with high predicted
<italic>cis</italic>
-regulatory content. Annotations in both text-based tracks are labeled with their corresponding PMIDs. Also shown are the original manual annotation in the FlyReg database, the automated mapping of these curated data in ORegAnno, and FlyBase genes, including the
<italic>α-γ-element </italic>
noncoding RNA gene that is expressed in response to heat shock. Differences in the FlyReg and ORegAnno mappings in (a) arise because the sequences for these regions are duplicated in the genome and alternative unique mappings are chosen in the two databases.</p>
</caption>
<graphic xlink:href="gb-2008-9-2-r31-5"></graphic>
</fig>
</sec>
<sec>
<title>DNA sequences extracted from text identify organisms and target genes</title>
<p>The organism referred to in a paper critically affects systems that attempt to recognize gene names in biomedical text and cross-reference them to external database identifiers [
<xref ref-type="bibr" rid="B32">32</xref>
]. Species identifiers are also a mandatory field in the ORegAnno curation process. Thus, we investigated if our sequence extraction and genome mapping process may provide a novel solution to the species identification problem in text mining. Of the 850 unique PMIDs with ORegAnno annotations in one or more of five species studied here (11 PMIDs have ORegAnno records for 2 species, and 1 PMID has ORegAnno records for 3 species), 295 had best genome hits obtained from extracted sequences. The correct species was identified using the genome with highest scoring BLAST hit for 92.9% (274/295) of PMIDs with hits extracted from text and ORegAnno annotations. We manually inspected the best genome hits that were incorrectly assigned to the wrong species and found that the vast majority were for hits among the three closely related mammalian species studied here (rat, mouse and human). Most of these incorrect assignments result from the requirement of a single best genome match, which can cause the wrong species identification for two reasons: first, a single PMID may report sequences (and therefore have ORegAnno records) for multiple species but only a single species gets chosen; second, only a single species is reported in the paper and annotated in ORegAnno, but the wrong species is assigned because the sequences (and BLAST scores) in another species are identical. In addition, a small number of 'incorrect' species assignments are because the species was actually incorrectly curated in the current ORegAnno annotation (for example, OREG0000115). These incorrect annotations have been deprecated and replaced by correct annotations in ORegAnno (for example, OREG0004685). These results demonstrate that primary text contains valuable information about the species under investigation encoded in extractable DNA sequences, but that mistaken species assignments may occur among closely related species or when sequences from multiple species are reported in a single paper.</p>
<p>Gene name recognition and normalization to database identifiers is an essential step in many text mining applications, but is a challenging task because of ambiguity and variation in how genes are named and used [
<xref ref-type="bibr" rid="B33">33</xref>
]. The identity of the target gene regulated by a
<italic>cis</italic>
-regulatory sequence is a key piece of information in regulatory bioinformatics and is a required field in an ORegAnno annotation. Thus, we investigated whether it is possible to automatically identify the target gene of putative
<italic>cis</italic>
-regulatory sequences extracted from text and mapped to genomes. To do this we simply identified the closest Ensembl gene to each text hit that was mapped to one of the five genomes. In the case of text hits found in introns, the closest gene was predicted to be the gene containing the intron, even if additional genes were present within the intron that were closer to the text hit. Each hit for PMIDs that generated multiple genomic hits was assigned its own putative target gene and evaluated for whether any of the PMID-target gene relationships were found in ORegAnno. For this analysis, we used a set of 259 PMIDs with ORegAnno annotations that provided a best hit to one of the five genomes and for which one or more predicted target gene names were found in the set of Ensembl normalized GeneIds in ORegAnno. For 162 PMIDs, the list of closest genes matched the list of correct target genes perfectly, and for an additional 40 PMIDs there was a partial match between the list of putative target genes and the true list of target GeneIDs in ORegAnno. Overall, 78.0% of PMIDs generated at least one text hit whose closest gene was the correct target gene. In general, extracting sequences from text yields a higher proportion of correct target genes (87.5-89.6%) in the larger mammalian genomes where gene density is relatively low. In contrast, in the compact genomes of
<italic>D. melanogaster </italic>
and
<italic>Caenorhabditis elegans</italic>
, a lower proportion of target genes is correctly identified (63.6-68.0%) since a text hit can have a higher probability of being closer to a neighboring gene than its true target in a compact genome. Remarkably, our simple DNA sequence-based gene name recognition method achieves levels of PPV (precision) that are higher than the median performance in BioCreAtIvE Task 1B [
<xref ref-type="bibr" rid="B32">32</xref>
] of advanced gene name recognition systems for flies (65.9%) and mice (76.5%). Additionally, since each PMID with a text extracted hit leads to at least one predicted target gene, our sequence extraction method identifies gene names from full-text articles at a rate (26.0%) comparable to dictionary-based gene name recognition in Medline abstracts (19.4%) [
<xref ref-type="bibr" rid="B34">34</xref>
].</p>
</sec>
<sec>
<title>A draft annotation of more than 2,000 papers with high
<italic>cis</italic>
-regulatory content</title>
<p>Among the 10,587 papers not currently curated in ORegAnno in our set of 11,437 PMIDs with high
<italic>cis</italic>
-regulatory content, we obtained hits to 5,875 genomic regions from 2,680 PMIDs. If we assume that approximately 65% of text hits from these 'prediction' papers are true positives (based on the overall PPV estimates above), we expect that approximately 3,800 of these text hits correspond to
<italic>cis</italic>
-regulatory sequences. The addition of these records would increase the number of annotations curated from small-scale experiments in ORegAnno by approximately 120%. Indeed, many of these are likely to be
<italic>bona fide </italic>
regulatory sequences, as shown by the fact that 338 text hits from papers not currently curated overlap 722 pre-existing ORegAnno annotations. For example, PMIDs 6814763 [
<xref ref-type="bibr" rid="B35">35</xref>
] and 2370864 [
<xref ref-type="bibr" rid="B36">36</xref>
] (which were both identified as having high
<italic>cis</italic>
-regulatory content by our vector space model) each provided an extractable sequence that mapped to previously annotated
<italic>cis</italic>
-regulatory elements in the
<italic>Hsp70 </italic>
promoter (Figure
<xref ref-type="fig" rid="F5">5a</xref>
). This result suggests even the most highly curated genomes have yet to achieve 'saturation annotation' and that a high level of redundant publication may exist for some regulatory elements, which can be used to support or extend current ORegAnno annotations. These predictions are not sufficient to stand as full ORegAnno records on their own, but should substantially decrease the time needed for the community annotation of these papers. In addition, these regions may be of sufficient resolution to be used by other workers in regulatory bioinformatics, and for these reasons we provide browser extensible data (BED) files for text-extracted sequences from both evaluation and prediction papers for the
<italic>D. melanogaster </italic>
(Additional data file 1), human (Additional data file 2), mouse (Additional data file 3),
<italic>C. elegans </italic>
(Additional data file 4), and rat (Additional data file 5) genomes.</p>
</sec>
</sec>
<sec>
<title>Discussion</title>
<p>A principle aim of genome biology is to decode complete transcriptional networks, so as to better understand how the activation of specific subnetworks affect developmental processes or responses to the environment, and how variation in transcriptional networks can lead to functional diversity over evolutionary time. As with all grand challenges in interpreting genome sequences, solving this ultimate aim will require combining both computational and experimental approaches. As the reliability of predictive regulatory sequence bioinformatics is relatively low [
<xref ref-type="bibr" rid="B37">37</xref>
], high-throughput experimental techniques currently prove to be the most efficient means of identifying regulatory sequences and assembling regulatory networks [
<xref ref-type="bibr" rid="B38">38</xref>
,
<xref ref-type="bibr" rid="B39">39</xref>
]. The gold standard for evaluating both computational and high-throughput experimental techniques continues to be the sizable body of prior knowledge contained in small-scale experimental studies on
<italic>cis</italic>
-regulatory sequences, much of which remains locked in the biomedical literature. Here we have shown that application of text-mining technologies, including literature management, information retrieval and information extraction systems, can accelerate the community annotation of
<italic>cis</italic>
-regulatory networks and sequences. These advances should help generate the necessary training and test sets to improve the reliability of computational and high-throughput experimental methods in regulatory biology.</p>
<p>Previously, it has been shown that manually curated and automatically extracted binary TF→TG interactions can be assembled into transcriptional regulatory networks [
<xref ref-type="bibr" rid="B6">6</xref>
-
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B24">24</xref>
]. Here we show that abstract relevance ranking using a vector space model can be used to enhance the manual annotation of binary TF→TG interactions, and should likewise further improve the automated extraction of binary TF→TG interactions to construct regulatory networks. We have also shown that the binary TF→TG interactions that are central to the construction of transcriptional regulatory networks can be extracted from text even when a full curation of the
<italic>cis</italic>
-regulatory sequence responsible for this interaction may not be possible. Our vector space model also has allowed us to generate an enhanced 'queue' of papers for annotation, and to gain a deeper insight into the size of the corpus of papers that may contain curatable
<italic>cis</italic>
-regulatory sequences, which we estimate is on the order of 30,000 papers or more. At the rate of approximately 1-2 hours curation time per paper, it would take a single person approximately 15-30 years to curate and annotate this corpus manually. This estimate demonstrates the need for distributed community annotation systems and for computational tools that can assist the extraction of relevant
<italic>cis</italic>
-regulatory information.</p>
<p>We have also investigated the potential of exploiting information contained in the DNA sequences reported in papers with high
<italic>cis</italic>
-regulatory content to assist regulatory annotation. Given the large number of DNA, RNA and peptide sequences reported in the biomedical literature, and the fact that sequences important enough to deserve mention in publication are likely to be of high biological significance, surprisingly little work has been conducted on extracting sequences from primary text [
<xref ref-type="bibr" rid="B25">25</xref>
,
<xref ref-type="bibr" rid="B40">40</xref>
]. The pioneering work of Wren
<italic>et al. </italic>
[
<xref ref-type="bibr" rid="B25">25</xref>
] showed that Markov models trained on English text, proteins and/or genomic DNA can be used to extract both DNA and peptide sequences from abstracts and full text with high precision. Wren
<italic>et al. </italic>
[
<xref ref-type="bibr" rid="B25">25</xref>
] also demonstrated that the extraction of DNA is more precise than peptides, and that the terminological context of the majority of extracted DNA sequences revealed that the sequence was likely to be a 'regulatory site' or 'motif' [
<xref ref-type="bibr" rid="B25">25</xref>
]. Our results directly support the claim that primary text contains a large number of DNA strings that are
<italic>cis</italic>
-regulatory sequences, which we also show can be automatically mapped to genome sequences to accelerate and enhance regulatory annotation. In addition to validating our approach, overlaps between ORegAnno annotations and text-based hits can be used as an automatic procedure to authenticate ORegAnno annotations, which can be indicated in the 'Score' profile for each ORegAnno record. As identifying and annotating
<italic>cis</italic>
-regulatory sequences in genomes currently remain among the most challenging branches of bioinformatics, ironically it may now be easier and more productive to identify functional
<italic>cis</italic>
-regulatory sequences in biomedical text rather than in DNA itself.</p>
<p>Our rule-based system for extracting and mapping DNA sequences could potentially be improved in several ways. One area to explore would be to implement more sophisticated sequence recognition techniques such as Markov models [
<xref ref-type="bibr" rid="B25">25</xref>
], although our initial comparisons suggest very similar overall performance. Inclusion of lowercase letters or degeneracy in the DNA alphabet of our rule-based method may allow many more
<italic>cis</italic>
-regulatory motifs to be extracted, but may also allow many more DNA-like English words to be extracted. Aside from variation in formatting [
<xref ref-type="bibr" rid="B25">25</xref>
], DNA strings in text should be easily discernable from English words and, therefore, identifiable by many alternative methods, since the upper limit of English words that can be spelled entirely in the DNA alphabet is small. For example, in a dictionary of approximately 355,000 English words [
<xref ref-type="bibr" rid="B41">41</xref>
], only 47 can be spelled entirely in DNA letters [ACGT], with an upper length of 7 characters for the word 'attacca,' a directive used at the end of a piece of music that is unlikely to be found in biomedical text. Inclusion of the entire set of ambiguity codes for DNA [ACGTMRWSYKVHDBXN] leads to a maximal English word size of only 13 characters for 'dharmashastra,' an ancient form of Indian jurisprudence. Thus, the vast majority of DNA-like strings of sufficient length to be mapped unambiguously to genomes are almost certainly
<italic>bona fide </italic>
DNA sequences. The main challenge for extracting DNA from text will be inaccuracies in the text encoding in older PDF documents, and the fact that many DNA sequences are embedded in tables, figures and supplementary materials. Although some figures have corresponding text encoded in the PDF, the use of text-recognition algorithms that operate on images would almost certainly improve the predictive power of our approach, and preliminary experiments have shown that this is the case (results not shown).</p>
<p>The area with the largest scope for improvement in using DNA in text to annotate genomes is the mapping of sequences to genomes (Table
<xref ref-type="table" rid="T2">2</xref>
), in part because of the short length of many
<italic>cis</italic>
-regulatory sequences. One way to solve this problem would be to combine sequence extraction with term recognition [
<xref ref-type="bibr" rid="B25">25</xref>
] to identify species or target gene names that could be used to reduce the search space for mapping extracted sequences to genomes. Another improvement would be to accept mappings to multiple species, which is also a more realistic solution than the requirement for a single 'best' species since the biological function of a reported sequence is likely to be the same closely related species. Improvements may also come from more lenient BLAST thresholds or the use of non-RepeatMasked versions of genomes, although these would almost certainly lead to higher false positive rates. Mapping regulatory sequences to repetitive genomic regions is a general problem, not only for text-extracted sequences, but also for manually curated data (Figure
<xref ref-type="fig" rid="F5">5a</xref>
). However, since many
<italic>cis</italic>
-regulatory elements may arise from transposable element sequences [
<xref ref-type="bibr" rid="B42">42</xref>
] or be located in segmental duplications (Figure
<xref ref-type="fig" rid="F5">5</xref>
), it will be necessary to solve the problem of representing and storing repetitive
<italic>cis</italic>
-regulatory elements for comprehensive regulatory annotation.</p>
<p>As presaged by Lincoln Stein [
<xref ref-type="bibr" rid="B1">1</xref>
], our results demonstrate that it is indeed possible to leverage text-mining technologies to accelerate genome annotation. Our proof of principle in the field of regulatory annotation is only one potential application of text-based genome sequence annotation. The general combining of information retrieval systems (for example, [
<xref ref-type="bibr" rid="B19">19</xref>
]) with sequence extraction techniques (for example, [
<xref ref-type="bibr" rid="B25">25</xref>
]) should allow researchers to enrich for any specific sub-domain of biomedical research and use sequence data reported in these corpora to directly annotate genomic regions of interest in a highly automated fashion. For example, the false positive mappings that correspond to coding sequences in our set of documents with high
<italic>cis</italic>
-regulatory content (see above) are likely to be mainly for proteins that bind to
<italic>cis</italic>
-regulatory sequences, and thus strategies similar to ours could accelerate the labor intensive identification of sequence specific TFs [
<xref ref-type="bibr" rid="B43">43</xref>
,
<xref ref-type="bibr" rid="B44">44</xref>
]. Clearly, it is preferable that researchers deposit and store their sequences and annotations in databases as a condition for publication and thereby preclude the need for post-publication extraction of such valuable biological data. With established databases for general sequence submission (for example, [
<xref ref-type="bibr" rid="B45">45</xref>
]) and specialized
<italic>cis</italic>
-regulatory annotation [
<xref ref-type="bibr" rid="B4">4</xref>
,
<xref ref-type="bibr" rid="B5">5</xref>
], researchers now have the necessary tools to deposit and archive their
<italic>cis</italic>
-regulatory data. In the absence of direct database submission, we recommend that researchers report certain minimum information (that is, absolute coordinates with genome build, sequence with sufficient flank, standard gene identifiers, official species name or identifiers) to assist the regulatory annotation (both human and automated) that is needed to help catalyze advances in the field of gene regulation.</p>
</sec>
<sec sec-type="materials|methods">
<title>Materials and methods</title>
<sec>
<title>Implementation of a vector space model to identify Medline abstracts with high
<italic>cis</italic>
-regulatory content</title>
<p>To identify papers with potential
<italic>cis</italic>
-regulatory data for community annotation, we used a vector space model [
<xref ref-type="bibr" rid="B19">19</xref>
] that represents each of the approximately 16 million scientific abstracts in Medline as a vector of index terms. Each vector element is a weight that is proportional to the relative importance of the term in the abstract (using the inverse document frequency or IDF). Relevancy ranking of the corpus is then achieved by calculating the similarity between each abstract and a query. This query can be represented by the same kind of vector as the documents, so that the similarities can be calculated by the cosine similarity measure between individual abstract vectors and the composite query vector. In practice, a good query vector can be constructed from the average properties of a training set of true positive abstracts. In this study, we used a '
<italic>cis</italic>
-regulatory' PubMed query that yielded a very high amount of true positives to generate our training set, namely: 'transcription and regulation and 'binding site' and (promoter or enhancer)'.</p>
</sec>
<sec>
<title>Pseudocuration of full-text articles</title>
<p>To evaluate the ability of our model to predict papers with high
<italic>cis</italic>
-regulatory content, we selected 344 papers from the top 100,000 scoring abstracts, of which 200 are uniformly distributed and 144 are related to the
<italic>Drosophila </italic>
transcription factor
<italic>eve</italic>
. Because the full curation of all 344 papers would require the organization of a second annotation jamboree, we opted for a distributed 'pseudocuration' procedure. Particularly, nine experienced curators examined whether these papers describe experimentally verified regulatory data and, if so, whether they also contain all the required data to allow genome annotation (that is, at a minimum the species, the sequence and its genomic location, the TF, and the TG). A web application was created where the curators could open a pending PMID and score the full-text paper as success or failure. Failures could be of four types: the publication describes binding site or promoter but there is insufficient information to annotate it; the publication describes transcription factor (complex) but not a binding site or promoter; the publication describes consensus binding sites or a reference to a primary publication but is itself not the correct source for annotation; and the publication does not describe a regulatory element. Regulatory interactions in the form of TF→TG were recorded as free text.</p>
</sec>
<sec>
<title>Extraction of DNA sequences from full-text and mapping to genome sequences</title>
<p>A unique list of 11,437 PMIDs was compiled from papers previously curated in FlyReg [
<xref ref-type="bibr" rid="B11">11</xref>
], ORegAnno [
<xref ref-type="bibr" rid="B5">5</xref>
] TRANSFAC v10.4 [
<xref ref-type="bibr" rid="B3">3</xref>
], plus unannotated papers in the ORegAnno Publication Queue, and the top 4,501 scoring abstracts identified by the vector space model that are extremely likely to contain
<italic>cis</italic>
-regulatory data (see above). To allow access to information in both older and more recent articles, full-text was downloaded automatically as PDFs where available using a custom script employing NCBI eutils [
<xref ref-type="bibr" rid="B46">46</xref>
]. PDFs were converted to plain text using pdftotext (v3.0) with option '-nopgbrk' [
<xref ref-type="bibr" rid="B47">47</xref>
]. Text was split into words and words greater than 10 characters in length with greater than 40% of characters from the capitalized DNA alphabet [ACGT] were extracted using regular expressions to isolate putative DNA sequences. All putative DNA sequences extracted from each paper were concatenated in the order they appeared in the text into a single fasta sequence and labeled with the corresponding PMID. Concatenation of sequences was performed to merge sequences split by line breaks in the text conversion, and because we reasoned that inappropriate joins would be reconciled at the genome level by local alignment procedures. Extracted, concatenated sequences were used as queries to BLAST RepeatMasked versions of genome sequences downloaded from the UCSC genome database [
<xref ref-type="bibr" rid="B48">48</xref>
] for the five species with greater than 100 ORegAnno database annotations:
<italic>D. melanogaster </italic>
(dm2), human (hg18), mouse (mm8),
<italic>C. elegans </italic>
(ce2) and rat (rn4). We note that these five genomes represent approximately 99% of the records currently in ORegAnno. NCBI-BLASTN v2.2.10 [
<xref ref-type="bibr" rid="B49">49</xref>
] was used to map extracted sequences to genome coordinates with an E-value cutoff of 10e-5. BLAST output was parsed into BED format using Jim Kent's source tree utilities, blastToPsl and pslToBed [
<xref ref-type="bibr" rid="B50">50</xref>
]. BLAST results for all five species were concurrently searched to find the genome that provided the best sum of BLAST scores to each fasta sequence, and this list of PMID-best genome matches was used to filter BED files to minimize spurious cross-species mapping. We then joined fragmented hits in the same genomic interval by clustering BED annotations for the same PMID within 1.0 KB on the same chromosome. Filtered, clustered BED annotations were assessed for their overlap with the 20-JUL-2007 mapping of ORegAnno annotations [
<xref ref-type="bibr" rid="B51">51</xref>
] using the Kent source tree utilities overlapSelect and bedIntersect. Finally, we identified a single putative target gene for each hit as the Ensembl [
<xref ref-type="bibr" rid="B52">52</xref>
] GeneId closest to each filtered, clustered BED annotation.</p>
</sec>
</sec>
<sec>
<title>Abbreviations</title>
<p>BED, browser extensible data; NCBI, National Center for Biotechnology Information; PMID, PubMed Identifier; PPV, positive predictive value; SVM, support vector machine; TF, transcription factor; TFBS, transcription factor binding site; TG, target gene; UCSC, University of California Santa Cruz.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>SA, MH, SvV and CMB conceived of the study and conducted the text mining experiments and analysis. SA, OLG, SJMJ, SBM and CMB designed and implemented the ORegAnno Publication Queue. SA, MH, OLG, PH, SJMJ, SBM, CMB and The Open Regulatory Annotation Consortium contributed to the curation activities of the RegCreative Jamboree. SA and CMB drafted the manuscript and all authors read and contributed to the final manuscript.</p>
</sec>
<sec>
<title>Additional data files</title>
<p>The following additional data files are available. Each additional data file is a UCSC genome BED formatted file that lists the chromosome, start coordinate, stop coordinate and PubMed identifier of text-extracted sequences on UCSC genome browser assemblies. Additional data file
<xref ref-type="supplementary-material" rid="S1">1</xref>
provides genomic coordinates of text hits to the dm2 version of the
<italic>D. melanogaster </italic>
genome. Additional data file
<xref ref-type="supplementary-material" rid="S2">2</xref>
provides genomic coordinates of text hits to the hg18 version of the human genome. Additional data file
<xref ref-type="supplementary-material" rid="S3">3</xref>
provides genomic coordinates of text hits to the mm8 version of the mouse genome. Additional data file
<xref ref-type="supplementary-material" rid="S4">4</xref>
provides genomic coordinates of text hits to the ce2 version of the
<italic>C. elegans </italic>
genome. Additional data file
<xref ref-type="supplementary-material" rid="S5">5</xref>
provides genomic coordinates of text hits to the rn4 version of the rat genome.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1">
<caption>
<title>Additional data file 1</title>
<p>The UCSC genome BED formatted file lists the chromosome, start coordinate, stop coordinate and PubMed identifier of text-extracted sequences on UCSC genome browser assemblies.</p>
</caption>
<media xlink:href="gb-2008-9-2-r31-S1.bed" mimetype="text" mime-subtype="plain">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S2">
<caption>
<title>Additional data file 2</title>
<p>The UCSC genome BED formatted file lists the chromosome, start coordinate, stop coordinate and PubMed identifier of text-extracted sequences on UCSC genome browser assemblies.</p>
</caption>
<media xlink:href="gb-2008-9-2-r31-S2.bed" mimetype="text" mime-subtype="plain">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S3">
<caption>
<title>Additional data file 3</title>
<p>The UCSC genome BED formatted file lists the chromosome, start coordinate, stop coordinate and PubMed identifier of text-extracted sequences on UCSC genome browser assemblies.</p>
</caption>
<media xlink:href="gb-2008-9-2-r31-S3.bed" mimetype="text" mime-subtype="plain">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S4">
<caption>
<title>Additional data file 4</title>
<p>The UCSC genome BED formatted file lists the chromosome, start coordinate, stop coordinate and PubMed identifier of text-extracted sequences on UCSC genome browser assemblies.</p>
</caption>
<media xlink:href="gb-2008-9-2-r31-S4.bed" mimetype="text" mime-subtype="plain">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="S5">
<caption>
<title>Additional data file 5</title>
<p>The UCSC genome BED formatted file lists the chromosome, start coordinate, stop coordinate and PubMed identifier of text-extracted sequences on UCSC genome browser assemblies.</p>
</caption>
<media xlink:href="gb-2008-9-2-r31-S5.bed" mimetype="text" mime-subtype="plain">
<caption>
<p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<sec>
<title>Acknowledgements</title>
<p>We thank Jonathan Wren for help running his Markov sequence extraction method as well as all of the participants of the RegCreative Jamboree for many fruitful discussions before, during and after the Jamboree. We are especially grateful to Martin Krallinger, Lynette Hirschman, Alfonso Valencia and Ewan Birney for encouraging links between the regulatory informatics and text-mining communities. SA is Postdoctoral Research Fellow of the FWO-Vlaanderen; MH is supported by a Marie Curie Early Stage Research Training Fellowship (MEST-CT-2004-504854) and the Plurigenes STREP project (LSHG-CT-2005-018673); OLG is supported by the Canadian Institutes of Health Research and the Michael Smith Foundation for Health Research; SBM is supported by the European Molecular Biology Organization and the Natural Sciences and Engineering Research Council of Canada. We also thank ENFIN, the BioSapiens Network, the Research Foundation - Flanders (FWO-Vlaanderen), Genome Canada and Genome British Columbia for financial support of the RegCreative Jamboree. This work is conducted as part of the NESCent
<italic>cis</italic>
-regulatory evolution working group supported by the NSF National Evolutionary Synthesis Center (NSF #EF-0423641).</p>
</sec>
</ack>
<ref-list>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stein</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Genome annotation: from sequence to biology.</article-title>
<source>Nat Rev Genet</source>
<year>2001</year>
<volume>2</volume>
<fpage>493</fpage>
<lpage>503</lpage>
<pub-id pub-id-type="pmid">11433356</pub-id>
<pub-id pub-id-type="doi">10.1038/35080529</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elsik</surname>
<given-names>CG</given-names>
</name>
<name>
<surname>Worley</surname>
<given-names>KC</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Milshina</surname>
<given-names>NV</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Reese</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Childs</surname>
<given-names>KL</given-names>
</name>
<name>
<surname>Venkatraman</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Dickens</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Weinstock</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>Gibbs</surname>
<given-names>RA</given-names>
</name>
</person-group>
<article-title>Community annotation: procedures, protocols, and supporting tools.</article-title>
<source>Genome Res</source>
<year>2006</year>
<volume>16</volume>
<fpage>1329</fpage>
<lpage>1333</lpage>
<pub-id pub-id-type="pmid">17065605</pub-id>
<pub-id pub-id-type="doi">10.1101/gr.5580606</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matys</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Kel-Margoulis</surname>
<given-names>OV</given-names>
</name>
<name>
<surname>Fricke</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Liebich</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Land</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Barre-Dirrie</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Reuter</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Chekmenev</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Krull</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hornischer</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Voss</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Stegmaier</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Lewicki-Potapov</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Saxel</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Kel</surname>
<given-names>AE</given-names>
</name>
<name>
<surname>Wingender</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<issue>Database issue</issue>
<fpage>D108</fpage>
<lpage>D110</lpage>
<pub-id pub-id-type="pmid">16381825</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkj143</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Portales-Casamar</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Kirov</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lim</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lithwick</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Swanson</surname>
<given-names>MI</given-names>
</name>
<name>
<surname>Ticoll</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Snoddy</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wasserman</surname>
<given-names>WW</given-names>
</name>
</person-group>
<article-title>PAZAR: a framework for collection and dissemination of
<italic>cis</italic>
-regulatory sequence annotation.</article-title>
<source>Genome Biol</source>
<year>2007</year>
<volume>8</volume>
<fpage>R207</fpage>
<pub-id pub-id-type="pmid">17916232</pub-id>
<pub-id pub-id-type="doi">10.1186/gb-2007-8-10-r207</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Montgomery</surname>
<given-names>SB</given-names>
</name>
<name>
<surname>Griffith</surname>
<given-names>OL</given-names>
</name>
<name>
<surname>Sleumer</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Bergman</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Bilenky</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pleasance</surname>
<given-names>ED</given-names>
</name>
<name>
<surname>Prychyna</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
</person-group>
<article-title>ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation.</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>637</fpage>
<lpage>640</lpage>
<pub-id pub-id-type="pmid">16397004</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btk027</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saric</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Rojas</surname>
<given-names>I</given-names>
</name>
</person-group>
<article-title>Large-scale extraction of gene regulation for model organisms in an ontological context.</article-title>
<source>In Silico Biol</source>
<year>2005</year>
<volume>5</volume>
<fpage>21</fpage>
<lpage>32</lpage>
<pub-id pub-id-type="pmid">15972005</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Saric</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Ouzounova</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Rojas</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Bork</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Extraction of regulatory gene/protein networks from Medline.</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>645</fpage>
<lpage>650</lpage>
<pub-id pub-id-type="pmid">16046493</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti597</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodriguez-Penagos</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Salgado</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Martinez-Flores</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Collado-Vides</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Automatic reconstruction of a bacterial regulatory network using Natural Language Processing.</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>293</fpage>
<pub-id pub-id-type="pmid">17683642</pub-id>
<pub-id pub-id-type="doi">10.1186/1471-2105-8-293</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="other">
<article-title>The RegCreative Jamboree</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.dmbr.ugent.be/bioit/contents/regcreative/"></ext-link>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Griffith</surname>
<given-names>OL</given-names>
</name>
<name>
<surname>Montgomery</surname>
<given-names>SB</given-names>
</name>
<name>
<surname>Bernier</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Kasaian</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Aerts</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mahony</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sleumer</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Bilenky</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Haeussler</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Griffith</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gallo</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Giardine</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Hooghe</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Van Loo</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Blanco</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Ticoll</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lithwick</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Portales-Casamar</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Donaldson</surname>
<given-names>IJ</given-names>
</name>
<name>
<surname>Robertson</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wadelius</surname>
<given-names>C</given-names>
</name>
<name>
<surname>De Bleser</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Vlieghe</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Halfon</surname>
<given-names>MS</given-names>
</name>
<name>
<surname>Wasserman</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Hardison</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Bergman</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>The Open Regulatory Annotation</surname>
<given-names>Consortium</given-names>
</name>
</person-group>
<article-title>ORegAnno: an open-access community-driven resource for regulatory annotation.</article-title>
<source>Nucleic Acids Res</source>
<year>2008</year>
<volume>36</volume>
<issue>Database issue</issue>
<fpage>D107</fpage>
<lpage>D113</lpage>
<pub-id pub-id-type="pmid">18006570</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkm967</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bergman</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Carlson</surname>
<given-names>JW</given-names>
</name>
<name>
<surname>Celniker</surname>
<given-names>SE</given-names>
</name>
</person-group>
<article-title>
<italic>Drosophila </italic>
DNase I footprint database: a systematic genome annotation of transcription factor binding sites in the fruitfly,
<italic>Drosophila melanogaster</italic>
.</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>1747</fpage>
<lpage>1749</lpage>
<pub-id pub-id-type="pmid">15572468</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti173</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gallo</surname>
<given-names>SM</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Halfon</surname>
<given-names>MS</given-names>
</name>
</person-group>
<article-title>REDfly: a Regulatory Element Database for
<italic>Drosophila</italic>
.</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>381</fpage>
<lpage>383</lpage>
<pub-id pub-id-type="pmid">16303794</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti794</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wasserman</surname>
<given-names>WW</given-names>
</name>
<name>
<surname>Fickett</surname>
<given-names>JW</given-names>
</name>
</person-group>
<article-title>Identification of regulatory regions which confer muscle-specific gene expression.</article-title>
<source>J Mol Biol</source>
<year>1998</year>
<volume>278</volume>
<fpage>167</fpage>
<lpage>181</lpage>
<pub-id pub-id-type="pmid">9571041</pub-id>
<pub-id pub-id-type="doi">10.1006/jmbi.1998.1700</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ho Sui</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Mortimer</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Arenillas</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Brumm</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Walsh</surname>
<given-names>CJ</given-names>
</name>
<name>
<surname>Kennedy</surname>
<given-names>BP</given-names>
</name>
<name>
<surname>Wasserman</surname>
<given-names>WW</given-names>
</name>
</person-group>
<article-title>oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes.</article-title>
<source>Nucleic Acids Res</source>
<year>2005</year>
<volume>33</volume>
<fpage>3154</fpage>
<lpage>3164</lpage>
<pub-id pub-id-type="pmid">15933209</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gki624</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blanco</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Farré</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Albà</surname>
<given-names>MM</given-names>
</name>
<name>
<surname>Messeguer</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Guigó</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>ABS: a database of annotated regulatory binding sites from orthologous promoters.</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<issue>Database issue</issue>
<fpage>D63</fpage>
<lpage>D67</lpage>
<pub-id pub-id-type="pmid">16381947</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkj116</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Xuan</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>MQ</given-names>
</name>
</person-group>
<article-title>TRED: a Transcriptional Regulatory Element Database and a platform for
<italic>in silico </italic>
gene regulation studies.</article-title>
<source>Nucleic Acids Res</source>
<year>2005</year>
<volume>33</volume>
<issue>Database issue</issue>
<fpage>D103</fpage>
<lpage>D107</lpage>
<pub-id pub-id-type="pmid">15608156</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gki004</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ghosh</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Object-oriented Transcription Factors Database (ooTFD).</article-title>
<source>Nucleic Acids Res</source>
<year>2000</year>
<volume>28</volume>
<fpage>308</fpage>
<lpage>310</lpage>
<pub-id pub-id-type="pmid">10592257</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/28.1.308</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sierro</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Kusakabe</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>K-J</given-names>
</name>
<name>
<surname>Yamashita</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Kinoshita</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Nakai</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>DBTGR: a database of tunicate promoters and their regulatory elements.</article-title>
<source>Nucleic Acids Res</source>
<year>2006</year>
<volume>34</volume>
<issue>Database issue</issue>
<fpage>D552</fpage>
<lpage>D555</lpage>
<pub-id pub-id-type="pmid">16381930</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkj064</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Glenisson</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Coessens</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Van Vooren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Mathys</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Moreau</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>TXTGate: profiling gene groups with text-based information.</article-title>
<source>Genome Biol</source>
<year>2004</year>
<volume>5</volume>
<fpage>R43</fpage>
<pub-id pub-id-type="pmid">15186494</pub-id>
<pub-id pub-id-type="doi">10.1186/gb-2004-5-6-r43</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Joachims</surname>
<given-names>T</given-names>
</name>
</person-group>
<person-group person-group-type="editor">
<name>
<surname>Schölkopf B, Burges C, Smola A</surname>
</name>
</person-group>
<article-title>Making large-scale support vector machine learning practical.</article-title>
<source>Advances in Kernel Methods: Support Vector Learning</source>
<year>1999</year>
<publisher-name>MIT Press</publisher-name>
<fpage>169</fpage>
<lpage>184</lpage>
</citation>
</ref>
<ref id="B21">
<citation citation-type="other">
<article-title>SVM Light</article-title>
<ext-link ext-link-type="uri" xlink:href="http://svmlight.joachims.org/"></ext-link>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Crosby</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Goodman</surname>
<given-names>JL</given-names>
</name>
<name>
<surname>Strelets</surname>
<given-names>VB</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Gelbart</surname>
<given-names>WM</given-names>
</name>
<name>
<surname>The FlyBase</surname>
<given-names>Consortium</given-names>
</name>
</person-group>
<article-title>FlyBase: genomes by the dozen.</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>Database issue</issue>
<fpage>D486</fpage>
<lpage>D491</lpage>
<pub-id pub-id-type="pmid">17099233</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkl827</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shannon</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Markiel</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ozier</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Baliga</surname>
<given-names>NS</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Ramage</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Amin</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Schwikowski</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Ideker</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Cytoscape: a software environment for integrated models of biomolecular interaction networks.</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<fpage>2498</fpage>
<lpage>2504</lpage>
<pub-id pub-id-type="pmid">14597658</pub-id>
<pub-id pub-id-type="doi">10.1101/gr.1239303</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ashburner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bergman</surname>
<given-names>CM</given-names>
</name>
</person-group>
<article-title>
<italic>Drosophila melanogaster</italic>
: a case study of a model genomic sequence and its consequences.</article-title>
<source>Genome Res</source>
<year>2005</year>
<volume>15</volume>
<fpage>1661</fpage>
<lpage>1667</lpage>
<pub-id pub-id-type="pmid">16339363</pub-id>
<pub-id pub-id-type="doi">10.1101/gr.3726705</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wren</surname>
<given-names>JD</given-names>
</name>
<name>
<surname>Hildebrand</surname>
<given-names>WH</given-names>
</name>
<name>
<surname>Chandrasekaran</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Melcher</surname>
<given-names>U</given-names>
</name>
</person-group>
<article-title>Markov model recognition and classification of DNA/protein sequences within large text databases.</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>4046</fpage>
<lpage>4053</lpage>
<pub-id pub-id-type="pmid">16159926</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/bti657</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Trinklein</surname>
<given-names>ND</given-names>
</name>
<name>
<surname>Aldred</surname>
<given-names>SJ</given-names>
</name>
<name>
<surname>Saldanha</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>RM</given-names>
</name>
</person-group>
<article-title>Identification and functional analysis of human transcriptional promoters.</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<fpage>308</fpage>
<lpage>312</lpage>
<pub-id pub-id-type="pmid">12566409</pub-id>
<pub-id pub-id-type="doi">10.1101/gr.794803</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pennacchio</surname>
<given-names>LA</given-names>
</name>
<name>
<surname>Ahituv</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Moses</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Prabhakar</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Nobrega</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Shoukry</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Minovitsky</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Dubchak</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Holt</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Plajzer-Frick</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Akiyama</surname>
<given-names>J</given-names>
</name>
<name>
<surname>De Val</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Afzal</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Black</surname>
<given-names>BL</given-names>
</name>
<name>
<surname>Couronne</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Eisen</surname>
<given-names>MB</given-names>
</name>
<name>
<surname>Visel</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>EM</given-names>
</name>
</person-group>
<article-title>
<italic>In vivo </italic>
enhancer analysis of human conserved non-coding sequences.</article-title>
<source>Nature</source>
<year>2006</year>
<volume>444</volume>
<fpage>499</fpage>
<lpage>502</lpage>
<pub-id pub-id-type="pmid">17086198</pub-id>
<pub-id pub-id-type="doi">10.1038/nature05295</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Robertson</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Hirst</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bainbridge</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bilenky</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Euskirchen</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bernier</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Varhol</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Delaney</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Thiessen</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Griffith</surname>
<given-names>OL</given-names>
</name>
<name>
<surname>He</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Marra</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Snyder</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.</article-title>
<source>Nat Methods</source>
<year>2007</year>
<volume>4</volume>
<fpage>651</fpage>
<lpage>657</lpage>
<pub-id pub-id-type="pmid">17558387</pub-id>
<pub-id pub-id-type="doi">10.1038/nmeth1068</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pierstorff</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bergman</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Wiehe</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Identifying
<italic>cis</italic>
-regulatory modules by combining comparative and compositional analysis of DNA.</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>2858</fpage>
<lpage>2864</lpage>
<pub-id pub-id-type="pmid">17032682</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl499</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lis</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Prestidge</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Hogness</surname>
<given-names>DS</given-names>
</name>
</person-group>
<article-title>A novel arrangement of tandemly repeated genes at a major heat shock site in
<italic>D. melanogaster</italic>
.</article-title>
<source>Cell</source>
<year>1978</year>
<volume>14</volume>
<fpage>901</fpage>
<lpage>919</lpage>
<pub-id pub-id-type="pmid">99245</pub-id>
<pub-id pub-id-type="doi">10.1016/0092-8674(78)90345-8</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Livak</surname>
<given-names>KJ</given-names>
</name>
<name>
<surname>Freund</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Schweber</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Wensink</surname>
<given-names>PC</given-names>
</name>
<name>
<surname>Meselson</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Sequence organization and transcription at two heat shock loci in
<italic>Drosophila</italic>
.</article-title>
<source>Proc Natl Acad Sci USA</source>
<year>1978</year>
<volume>75</volume>
<fpage>5613</fpage>
<lpage>5617</lpage>
<pub-id pub-id-type="pmid">103099</pub-id>
<pub-id pub-id-type="doi">10.1073/pnas.75.11.5613</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hirschman</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Colosimo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Morgan</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Yeh</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Overview of BioCreAtIvE task 1B: normalized gene lists.</article-title>
<source>BMC Bioinformatics</source>
<year>2005</year>
<volume>6</volume>
<issue>Suppl 1</issue>
<fpage>S11</fpage>
<pub-id pub-id-type="pmid">15960823</pub-id>
<pub-id pub-id-type="doi">10.1186/1471-2105-6-S1-S11</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Leser</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Hakenberg</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>What makes a gene name? Named entity recognition in the biomedical literature.</article-title>
<source>Brief Bioinform</source>
<year>2005</year>
<volume>6</volume>
<fpage>357</fpage>
<lpage>369</lpage>
<pub-id pub-id-type="pmid">16420734</pub-id>
<pub-id pub-id-type="doi">10.1093/bib/6.4.357</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jenssen</surname>
<given-names>TK</given-names>
</name>
<name>
<surname>Laegreid</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Komorowski</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Hovig</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>A literature network of human genes for high-throughput analysis of gene expression.</article-title>
<source>Nat Genet</source>
<year>2001</year>
<volume>28</volume>
<fpage>21</fpage>
<lpage>28</lpage>
<pub-id pub-id-type="pmid">11326270</pub-id>
<pub-id pub-id-type="doi">10.1038/88213</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pelham</surname>
<given-names>HR</given-names>
</name>
</person-group>
<article-title>A regulatory upstream promoter element in the
<italic>Drosophila hsp 70 </italic>
heat-shock gene.</article-title>
<source>Cell</source>
<year>1982</year>
<volume>30</volume>
<fpage>517</fpage>
<lpage>528</lpage>
<pub-id pub-id-type="pmid">6814763</pub-id>
<pub-id pub-id-type="doi">10.1016/0092-8674(82)90249-5</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gilmour</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Dietz</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Elgin</surname>
<given-names>SC</given-names>
</name>
</person-group>
<article-title>UV cross-linking identifies four polypeptides that require the TATA box to bind to the
<italic>Drosophila hsp70 </italic>
promoter.</article-title>
<source>Mol Cell Biol</source>
<year>1990</year>
<volume>10</volume>
<fpage>4233</fpage>
<lpage>4238</lpage>
<pub-id pub-id-type="pmid">2370864</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tompa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Bailey</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>De Moor</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Eskin</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Favorov</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Frith</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
<name>
<surname>Makeev</surname>
<given-names>VJ</given-names>
</name>
<name>
<surname>Mironov</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Noble</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Pavesi</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Pesole</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Regnier</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Simonis</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Thijs</surname>
<given-names>G</given-names>
</name>
<name>
<surname>van Helden</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Vandenbogaert</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Weng</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Workman</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Assessing computational tools for the discovery of transcription factor binding sites.</article-title>
<source>Nat Biotechnol</source>
<year>2005</year>
<volume>23</volume>
<fpage>137</fpage>
<lpage>144</lpage>
<pub-id pub-id-type="pmid">15637633</pub-id>
<pub-id pub-id-type="doi">10.1038/nbt1053</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Harbison</surname>
<given-names>CT</given-names>
</name>
<name>
<surname>Gordon</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>TI</given-names>
</name>
<name>
<surname>Rinaldi</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Macisaac</surname>
<given-names>KD</given-names>
</name>
<name>
<surname>Danford</surname>
<given-names>TW</given-names>
</name>
<name>
<surname>Hannett</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Tagne</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Reynolds</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Yoo</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jennings</surname>
<given-names>EG</given-names>
</name>
<name>
<surname>Zeitlinger</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Pokholok</surname>
<given-names>DK</given-names>
</name>
<name>
<surname>Kellis</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rolfe</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Takusagawa</surname>
<given-names>KT</given-names>
</name>
<name>
<surname>Lander</surname>
<given-names>ES</given-names>
</name>
<name>
<surname>Gifford</surname>
<given-names>DK</given-names>
</name>
<name>
<surname>Fraenkel</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>RA</given-names>
</name>
</person-group>
<article-title>Transcriptional regulatory code of a eukaryotic genome.</article-title>
<source>Nature</source>
<year>2004</year>
<volume>431</volume>
<fpage>99</fpage>
<lpage>104</lpage>
<pub-id pub-id-type="pmid">15343339</pub-id>
<pub-id pub-id-type="doi">10.1038/nature02800</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>TI</given-names>
</name>
<name>
<surname>Rinaldi</surname>
<given-names>NJ</given-names>
</name>
<name>
<surname>Robert</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Odom</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Bar-Joseph</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Gerber</surname>
<given-names>GK</given-names>
</name>
<name>
<surname>Hannett</surname>
<given-names>NM</given-names>
</name>
<name>
<surname>Harbison</surname>
<given-names>CT</given-names>
</name>
<name>
<surname>Thompson</surname>
<given-names>CM</given-names>
</name>
<name>
<surname>Simon</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Zeitlinger</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Jennings</surname>
<given-names>EG</given-names>
</name>
<name>
<surname>Murray</surname>
<given-names>HL</given-names>
</name>
<name>
<surname>Gordon</surname>
<given-names>DB</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Wyrick</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Tagne</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Volkert</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Fraenkel</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Gifford</surname>
<given-names>DK</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>RA</given-names>
</name>
</person-group>
<article-title>Transcriptional regulatory networks in
<italic>Saccharomyces cerevisiae</italic>
.</article-title>
<source>Science</source>
<year>2002</year>
<volume>298</volume>
<fpage>799</fpage>
<lpage>804</lpage>
<pub-id pub-id-type="pmid">12399584</pub-id>
<pub-id pub-id-type="doi">10.1126/science.1075090</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shtatland</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Guettler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kossodo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pivovarov</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Weissleder</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>PepBank - a database of peptides based on sequence text mining and public peptide data sources.</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>280</fpage>
<pub-id pub-id-type="pmid">17678535</pub-id>
<pub-id pub-id-type="doi">10.1186/1471-2105-8-280</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="other">
<article-title>The GNU Collaborative International Dictionary of English</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.ibiblio.org/webster/"></ext-link>
</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jordan</surname>
<given-names>IK</given-names>
</name>
<name>
<surname>Rogozin</surname>
<given-names>IB</given-names>
</name>
<name>
<surname>Glazko</surname>
<given-names>GV</given-names>
</name>
<name>
<surname>Koonin</surname>
<given-names>EV</given-names>
</name>
</person-group>
<article-title>Origin of a substantial fraction of human regulatory sequences from transposable elements.</article-title>
<source>Trends Genet</source>
<year>2003</year>
<volume>19</volume>
<fpage>68</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="pmid">12547512</pub-id>
<pub-id pub-id-type="doi">10.1016/S0168-9525(02)00006-9</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adryan</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Teichmann</surname>
<given-names>SA</given-names>
</name>
</person-group>
<article-title>FlyTF: a systematic review of site-specific transcription factors in the fruit fly
<italic>Drosophila melanogaster</italic>
.</article-title>
<source>Bioinformatics</source>
<year>2006</year>
<volume>22</volume>
<fpage>1532</fpage>
<lpage>1533</lpage>
<pub-id pub-id-type="pmid">16613907</pub-id>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btl143</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reece-Hoyes</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Deplancke</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Shingles</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Grove</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Hope</surname>
<given-names>IA</given-names>
</name>
<name>
<surname>Walhout</surname>
<given-names>AJ</given-names>
</name>
</person-group>
<article-title>A compendium of
<italic>Caenorhabditis elegans </italic>
regulatory transcription factors: a resource for mapping transcription regulatory networks.</article-title>
<source>Genome Biol</source>
<year>2005</year>
<volume>6</volume>
<fpage>R110</fpage>
<pub-id pub-id-type="pmid">16420670</pub-id>
<pub-id pub-id-type="doi">10.1186/gb-2005-6-13-r110</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benson</surname>
<given-names>DA</given-names>
</name>
<name>
<surname>Karsch-Mizrachi</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Ostell</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wheeler</surname>
<given-names>DL</given-names>
</name>
</person-group>
<article-title>GenBank.</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>Database issue</issue>
<fpage>D21</fpage>
<lpage>D25</lpage>
<pub-id pub-id-type="pmid">17202161</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkl986</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="other">
<article-title>Entrez Programming Utilities</article-title>
<ext-link ext-link-type="uri" xlink:href="http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html"></ext-link>
</citation>
</ref>
<ref id="B47">
<citation citation-type="other">
<article-title>pdftotext</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.foolabs.com/xpdf/"></ext-link>
</citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuhn</surname>
<given-names>RM</given-names>
</name>
<name>
<surname>Karolchik</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Zweig</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Trumbower</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Thomas</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Thakkapallayil</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Sugnet</surname>
<given-names>CW</given-names>
</name>
<name>
<surname>Stanke</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>KE</given-names>
</name>
<name>
<surname>Siepel</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rosenbloom</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Rhead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Raney</surname>
<given-names>BJ</given-names>
</name>
<name>
<surname>Pohl</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Pedersen</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Hinrichs</surname>
<given-names>AS</given-names>
</name>
<name>
<surname>Harte</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Diekhans</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Clawson</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Bejerano</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Barber</surname>
<given-names>GP</given-names>
</name>
<name>
<surname>Baertsch</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kent</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>The UCSC genome browser database: update 2007.</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>Database issue</issue>
<fpage>D668</fpage>
<lpage>D673</lpage>
<pub-id pub-id-type="pmid">17142222</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkl928</pub-id>
</citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>SF</given-names>
</name>
<name>
<surname>Madden</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Schaffer</surname>
<given-names>AA</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</article-title>
<source>Nucleic Acids Res</source>
<year>1997</year>
<volume>25</volume>
<fpage>3389</fpage>
<lpage>3402</lpage>
<pub-id pub-id-type="pmid">9254694</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/25.17.3389</pub-id>
</citation>
</ref>
<ref id="B50">
<citation citation-type="other">
<article-title>Kent Source Tree</article-title>
<ext-link ext-link-type="uri" xlink:href="http://genome.ucsc.edu/google/admin/cvs.html"></ext-link>
</citation>
</ref>
<ref id="B51">
<citation citation-type="other">
<article-title>ORegAnno Wiki</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.bcgsc.ca/wiki/display/oreganno/DataFiles"></ext-link>
</citation>
</ref>
<ref id="B52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hubbard</surname>
<given-names>TJ</given-names>
</name>
<name>
<surname>Aken</surname>
<given-names>BL</given-names>
</name>
<name>
<surname>Beal</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ballester</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Caccamo</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Clarke</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Coates</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Cunningham</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Cutts</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Down</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Dyer</surname>
<given-names>SC</given-names>
</name>
<name>
<surname>Fitzgerald</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Fernandez-Banet</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Graf</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Haider</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hammond</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Herrero</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Holland</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Howe</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Howe</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Kahari</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Keefe</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Kokocinski</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Kulesha</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Lawson</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Longden</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Melsopp</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Megy</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ensembl 2007.</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<issue>Database issue</issue>
<fpage>D610</fpage>
<lpage>D617</lpage>
<pub-id pub-id-type="pmid">17148474</pub-id>
<pub-id pub-id-type="doi">10.1093/nar/gkl996</pub-id>
</citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Belgique/explor/OpenAccessBelV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000239 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000239 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Belgique
   |area=    OpenAccessBelV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:2374703
   |texte=   Text-mining assisted regulatory annotation
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:18271954" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a OpenAccessBelV2 

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Dec 1 00:43:49 2016. Site generation: Wed Mar 6 14:51:30 2024