Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Annotating genes and genomes with DNA sequences extracted from biomedical articles

Identifieur interne : 000157 ( Pmc/Curation ); précédent : 000156; suivant : 000158

Annotating genes and genomes with DNA sequences extracted from biomedical articles

Auteurs : Maximilian Haeussler ; Martin Gerner ; Casey M. Bergman

Source :

RBID : PMC:3065681

Abstract

Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study.

Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments.

Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data.

Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org.

Contact: maximilianh@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.


Url:
DOI: 10.1093/bioinformatics/btr043
PubMed: 21325301
PubMed Central: 3065681

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:3065681

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Annotating genes and genomes with DNA sequences extracted from biomedical articles</title>
<author>
<name sortKey="Haeussler, Maximilian" sort="Haeussler, Maximilian" uniqKey="Haeussler M" first="Maximilian" last="Haeussler">Maximilian Haeussler</name>
</author>
<author>
<name sortKey="Gerner, Martin" sort="Gerner, Martin" uniqKey="Gerner M" first="Martin" last="Gerner">Martin Gerner</name>
</author>
<author>
<name sortKey="Bergman, Casey M" sort="Bergman, Casey M" uniqKey="Bergman C" first="Casey M." last="Bergman">Casey M. Bergman</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">21325301</idno>
<idno type="pmc">3065681</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3065681</idno>
<idno type="RBID">PMC:3065681</idno>
<idno type="doi">10.1093/bioinformatics/btr043</idno>
<date when="2011">2011</date>
<idno type="wicri:Area/Pmc/Corpus">000157</idno>
<idno type="wicri:Area/Pmc/Curation">000157</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Annotating genes and genomes with DNA sequences extracted from biomedical articles</title>
<author>
<name sortKey="Haeussler, Maximilian" sort="Haeussler, Maximilian" uniqKey="Haeussler M" first="Maximilian" last="Haeussler">Maximilian Haeussler</name>
</author>
<author>
<name sortKey="Gerner, Martin" sort="Gerner, Martin" uniqKey="Gerner M" first="Martin" last="Gerner">Martin Gerner</name>
</author>
<author>
<name sortKey="Bergman, Casey M" sort="Bergman, Casey M" uniqKey="Bergman C" first="Casey M." last="Bergman">Casey M. Bergman</name>
</author>
</analytic>
<series>
<title level="j">Bioinformatics</title>
<idno type="ISSN">1367-4803</idno>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>
<bold>Motivation:</bold>
Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study.</p>
<p>
<bold>Results:</bold>
Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments.</p>
<p>
<bold>Conclusion:</bold>
Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data.</p>
<p>
<bold>Availability and implementation:</bold>
Source code is available under a BSD license from
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/text2genome/">http://sourceforge.net/projects/text2genome/</ext-link>
and results can be browsed and downloaded at
<ext-link ext-link-type="uri" xlink:href="http://text2genome.org">http://text2genome.org</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>maximilianh@gmail.com</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Aerts, S" uniqKey="Aerts S">S. Aerts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Anderson, N R" uniqKey="Anderson N">N.R. Anderson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benson, D A" uniqKey="Benson D">D.A. Benson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cock, P J" uniqKey="Cock P">P.J. Cock</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Colosimo, M E" uniqKey="Colosimo M">M.E. Colosimo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dowell, R D" uniqKey="Dowell R">R.D. Dowell</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fulp, C T" uniqKey="Fulp C">C.T. Fulp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Garcia Remesal, M" uniqKey="Garcia Remesal M">M. Garcia-Remesal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Garcia Remesal, M" uniqKey="Garcia Remesal M">M. Garcia-Remesal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gerner, M" uniqKey="Gerner M">M. Gerner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gibson, U E" uniqKey="Gibson U">U.E. Gibson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gray, P W" uniqKey="Gray P">P.W. Gray</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hakenberg, J" uniqKey="Hakenberg J">J. Hakenberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Holley, R W" uniqKey="Holley R">R.W. Holley</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hubbard, T J" uniqKey="Hubbard T">T.J. Hubbard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karolchik, D" uniqKey="Karolchik D">D. Karolchik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kent, W J" uniqKey="Kent W">W.J. Kent</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kersey, P J" uniqKey="Kersey P">P.J. Kersey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krallinger, M" uniqKey="Krallinger M">M. Krallinger</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Maglott, D" uniqKey="Maglott D">D. Maglott</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Morgan, A A" uniqKey="Morgan A">A.A. Morgan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rhead, B" uniqKey="Rhead B">B. Rhead</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, R J" uniqKey="Roberts R">R.J. Roberts</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Semon, D" uniqKey="Semon D">D. Semon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shtatland, T" uniqKey="Shtatland T">T. Shtatland</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vandesompele, J" uniqKey="Vandesompele J">J. Vandesompele</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Visel, A" uniqKey="Visel A">A. Visel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weiss, M S" uniqKey="Weiss M">M. S. Weiss</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wren, J D" uniqKey="Wren J">J.D. Wren</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yoshida, Y" uniqKey="Yoshida Y">Y. Yoshida</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Bioinformatics</journal-id>
<journal-id journal-id-type="publisher-id">bioinformatics</journal-id>
<journal-id journal-id-type="hwp">bioinfo</journal-id>
<journal-title-group>
<journal-title>Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="ppub">1367-4803</issn>
<issn pub-type="epub">1367-4811</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">21325301</article-id>
<article-id pub-id-type="pmc">3065681</article-id>
<article-id pub-id-type="doi">10.1093/bioinformatics/btr043</article-id>
<article-id pub-id-type="publisher-id">btr043</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Papers</subject>
<subj-group>
<subject>Data and Text Mining</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Annotating genes and genomes with DNA sequences extracted from biomedical articles</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Haeussler</surname>
<given-names>Maximilian</given-names>
</name>
<xref ref-type="corresp" rid="COR1">*</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Gerner</surname>
<given-names>Martin</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bergman</surname>
<given-names>Casey M.</given-names>
</name>
</contrib>
</contrib-group>
<aff>Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK</aff>
<author-notes>
<corresp id="COR1">*To whom correspondence should be addressed.</corresp>
<fn>
<p>Associate Editor: Alex Bateman</p>
</fn>
</author-notes>
<pub-date pub-type="ppub">
<day>1</day>
<month>4</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="epub">
<day>16</day>
<month>2</month>
<year>2011</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>16</day>
<month>2</month>
<year>2011</year>
</pub-date>
<pmc-comment> PMC Release delay is 0 months and 0 days and was based on the . </pmc-comment>
<volume>27</volume>
<issue>7</issue>
<fpage>980</fpage>
<lpage>986</lpage>
<history>
<date date-type="received">
<day>15</day>
<month>12</month>
<year>2010</year>
</date>
<date date-type="rev-recd">
<day>21</day>
<month>1</month>
<year>2011</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>1</month>
<year>2011</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2011. Published by Oxford University Press.</copyright-statement>
<copyright-year>2011</copyright-year>
<license license-type="creative-commons" xlink:href="http://creativecommons.org/licenses/by-nc/2.5">
<license-p>
<pmc-comment>CREATIVE COMMONS</pmc-comment>
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/2.5">http://creativecommons.org/licenses/by-nc/2.5</ext-link>
), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract>
<p>
<bold>Motivation:</bold>
Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study.</p>
<p>
<bold>Results:</bold>
Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments.</p>
<p>
<bold>Conclusion:</bold>
Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data.</p>
<p>
<bold>Availability and implementation:</bold>
Source code is available under a BSD license from
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/text2genome/">http://sourceforge.net/projects/text2genome/</ext-link>
and results can be browsed and downloaded at
<ext-link ext-link-type="uri" xlink:href="http://text2genome.org">http://text2genome.org</ext-link>
.</p>
<p>
<bold>Contact:</bold>
<email>maximilianh@gmail.com</email>
</p>
<p>
<bold>Supplementary information:</bold>
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary data</ext-link>
are available at
<italic>Bioinformatics</italic>
online.</p>
</abstract>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="SEC1">
<title>1 INTRODUCTION</title>
<p>A common challenge encountered by many biomedical researchers is to obtain a summary of the relevant literature pertaining to a particular gene or genomic region. With nearly 2000 articles added to MEDLINE on a daily basis (
<ext-link ext-link-type="uri" xlink:href="http://www.nlm.nih.gov/bsd/index_stats_comp.html">http://www.nlm.nih.gov/bsd/index_stats_comp.html</ext-link>
), it is increasingly difficult to keep up with the rapid pace of publication outside ones immediate domain of expertise. The problem of finding relevant articles for a particular locus is becoming more acute as researchers increasingly adopt high-throughput genomic technologies (microarrays, genome-wide association studies, high-throughput sequencing, etc.). These genome-wide approaches often generate low-level data on thousands of genes or genomic regions, the interpretation of which becomes much more valuable when integrated with previously published studies on individual loci.</p>
<p>The challenge of linking articles to genes is partially solved for a limited number of model organisms, where dedicated teams of curators scan the literature and link publications to gene records in individual model organism databases such as FlyBase (
<xref ref-type="bibr" rid="B7">The FlyBase Consortium, 2003</xref>
), or through federated multi-organism databases such as Entrez Gene (
<xref ref-type="bibr" rid="B21">Maglott
<italic>et al.</italic>
, 2007</xref>
). However, these collections are not comprehensive and for the majority of species, including human, efforts to curate gene–article associations remain incomplete. In principle, automatic linking of articles to genes could be achieved by developing text-mining tools that detect gene names or identifiers in abstracts or full-text articles. However, gene names are not consistently used and are often not unique and developing accurate methods to resolve and disambiguate gene names in text and link them to database identifiers remains an active area of research (
<xref ref-type="bibr" rid="B20">Krallinger
<italic>et al.</italic>
, 2008</xref>
).</p>
<p>Even with curated or automatically generated links between articles and genes, the exact genomic sequences referred to in an article currently can only be determined by human interpretation of the full text. Furthermore, specific questions such as ‘which transcript was cloned?’, ‘which exon was amplified?’ or ‘where in the genome is a particular mutation found?’ can take considerable time for an individual researcher to answer, often requiring labor-intensive manual interaction between the literature and genomic databases.</p>
<p>For publications where authors report DNA sequences directly, these problems could be solved if all published sequences were systematically sent to primary sequence databases such as GenBank (
<xref ref-type="bibr" rid="B3">Benson
<italic>et al.</italic>
, 2010</xref>
). However, in the post-genomic era fewer articles report primary DNA sequences directly and instead only report primers used for polymerase chain reaction (PCR)-based techniques that are designed from published genome sequences. Furthermore, in contrast to longer DNA sequences, journals generally do not require deposition of short primer sequences in databases, and the minimum sequence length required for a GenBank submission is 50 bp (
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html">http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html</ext-link>
). As such, many DNA sequences that could provide unique tags to link articles to specific genes and genomes remain locked in the biological literature.</p>
<p>The possibility that DNA and protein sequences can be extracted from biomedical text was first demonstrated by
<xref ref-type="bibr" rid="B30">Wren
<italic>et al.</italic>
(2005</xref>
) and subsequently by several other groups (
<xref ref-type="bibr" rid="B1">Aerts
<italic>et al.</italic>
, 2008</xref>
;
<xref ref-type="bibr" rid="B9">Garcia-Remesal
<italic>et al.</italic>
, 2010a</xref>
,
<xref ref-type="bibr" rid="B10">b</xref>
;
<xref ref-type="bibr" rid="B26">Shtatland
<italic>et al.</italic>
, 2007</xref>
).
<xref ref-type="bibr" rid="B1">Aerts
<italic>et al.</italic>
(2008</xref>
) extended this technique to show that DNA sequences extracted from biomedical text could be mapped to genome sequences to identify the location, organism and target gene mentioned in an article. The approach of
<xref ref-type="bibr" rid="B1">Aerts
<italic>et al.</italic>
(2008</xref>
) was inspired by, and tailored to, longer sequences typically found in publications on
<italic>cis</italic>
-regulatory regions. For this particular use case, genome mapping using the single best BLAST match on a small number of model organism genomes provided high-precision results. However, this basic approach is not suitable to the more ambitious application of mapping all sequences from articles to all genomes, since the short size of many sequences in articles (e.g. PCR primers) and the increasing size of genome databases requires more sophisticated mapping techniques. In addition,
<xref ref-type="bibr" rid="B1">Aerts
<italic>et al.</italic>
(2008</xref>
) did not provide software for users to run and extend or a database for users to download and browse results in an intuitive way.</p>
<p>Here, we address the question of whether annotation of all genomes with sequences from the available open access (OA) biomedical literature is a realizable and practical goal. We show that the automated extraction and mapping of DNA sequences from more than 150 000 OA full-text articles in PubMed Central (PMC) is indeed possible and present a software implementation to achieve this aim called ‘text2genome’. We map extracted sequences to 224 genomes and provide easily searchable results in the form of both a web interface and genome annotation tracks for the Ensembl (
<xref ref-type="bibr" rid="B16">Hubbard
<italic>et al.</italic>
, 2009</xref>
) and UCSC genome browsers (
<xref ref-type="bibr" rid="B23">Rhead
<italic>et al.</italic>
, 2010</xref>
). We demonstrate that we can associate articles with relevant genes and genomes by evaluating text2genome results on the subset of articles that also have GenBank records. Finally, we provide example use cases to demonstrate potential applications of our approach, by intersecting text2genome mappings with ChIP-seq data, and by querying for articles that report quantitative reverse transcriptase (RT)-PCR experiments for a given genomic locus. Our work provides a unique and timely resource for interpreting both biomedical literature and genomic data, and will help aid discovery across many domains of the life sciences.</p>
</sec>
<sec id="SEC2">
<title>2 SYSTEM AND METHODS</title>
<sec id="SEC2.1">
<title>2.1 Data sources</title>
<p>PMC-OA (
<xref ref-type="bibr" rid="B24">Roberts, 2001</xref>
) full-text articles were downloaded in June 2010. PMCID–PMID associations were obtained from the PMC FTP server (
<ext-link ext-link-type="ftp" xlink:href="ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz">ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz</ext-link>
). When available, we used XML files, or the ASCII text version of the PMC-OA article based on optical character recognition (OCR) of the original pdf. For PMC-OA articles where neither XML nor text files were available, pdftotext (
<ext-link ext-link-type="uri" xlink:href="http://www.foolabs.com/xpdf/">http://www.foolabs.com/xpdf/</ext-link>
) was used to convert the PDF file to ASCII text. Additionally, we processed text (or converted text) from supplementary files of the following document types: HTML, CSV, TXT, XML, DOC, XLS, PPT and PDF.</p>
<p>Starting with a complete download of GenBank version 176, we kept only non-high-throughput divisions that are relevant to this study: bct, inv, mam, pln, pri, rod, vrt. The resulting dataset of 7.09 million records were parsed with BioPython (
<xref ref-type="bibr" rid="B4">Cock
<italic>et al.</italic>
, 2009</xref>
) into relational database tables. Sequences > 1 Mb were eliminated to remove large sequences from high-throughput studies that were deposited in the incorrect division of GenBank (e.g. chromosome sequences of
<italic>Drosophila melanogaster</italic>
). Entrez Gene data were downloaded in October 2009.</p>
<p>RepeatMasked genomes and gene transcripts were obtained from Ensembl Release 56 (
<xref ref-type="bibr" rid="B16">Hubbard
<italic>et al.</italic>
, 2009</xref>
) and EnsemblGenomes Release 3 (
<xref ref-type="bibr" rid="B19">Kersey
<italic>et al.</italic>
, 2010</xref>
). The taxa represented in these genomes include 68 animals, 134 bacteria, 10 fungi, 8 plants and and 4 protists, totalling 224 organisms with a NCBI taxon ID.</p>
</sec>
<sec id="SEC2.2">
<title>2.2 Text and sequence processing algorithm</title>
<p>We developed a simple procedure to detect nucleotide sequences in articles that accounts for the presence of OCR errors and the fact that sequences can be separated by spaces and line breaks (See
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 1</ext-link>
for details). In brief, we first removed all non-letter mark-up characters from a text, then concatenated words that (i) contained exclusively nucleotide letters or (ii) contained a certain percentage of nucleotide letters (
<italic>a</italic>
,
<italic>c</italic>
,
<italic>t</italic>
,
<italic>g</italic>
and
<italic>u</italic>
) above a length cut-off of 19. FASTA sequences extracted from PMC-OA were then searched with BLAT (
<xref ref-type="bibr" rid="B18">Kent, 2002</xref>
) in genes and genomes from Ensembl/EnsemblGenomes with a minimum number of 19 identical base pairs. Articles with exceedingly long (> 1 Mb) or many (> 100) sequences were removed from further processing to increase the precision of our approach, since some supplementary files contain genome-wide sequence data (e.g. microarray probes).</p>
<p>The resulting BLAT matches from genomes and transcripts were then filtered to obtain the best matching species, genes and genomic regions. For each extracted sequence, only the highest scoring hits were retained. Hits to common plasmids and sequencing vectors were removed (using data from NCBI Univec). In order to disambiguate sequences matching several different organisms equally well, we extracted all mentions of organism names from the articles using default settings of LINNAEUS (
<xref ref-type="bibr" rid="B11">Gerner
<italic>et al.</italic>
, 2010</xref>
). If the full text contained organism names detected by LINNAEUS, only BLAT matches for these genomes were kept. If no organism mentions were found, the matches were limited to human and major model organisms. If there was no best match among these genomes, all remaining matches were retained. The best genome was determined as the one with the highest number of matching sequences at the gene or genomic level. To account for conserved sequences that may hit highly similar genomes (e.g. chimpanzee and human), the best genome for species that had the same number of best hits was decided by ranking genomes based on the species with the higher number of publications in Entrez Gene.</p>
<p>Hits on the best genome were fused into ‘chains’ if they were located closer than 50 kb; hits on transcripts from the best genome were chained if they matched the same gene. When a sequence was a member of several chains (e.g. caused by matches to segmental duplications), the hit was retained only for the chain with the maximum number of other matching sequences. Genes were predicted to be hit only if they matched at least two text-extracted sequences. If two genes passed this threshold and were hit by exactly the same sequences, only the gene with the largest number of publications in Entrez Gene was retained.</p>
<p>In general, our filtering steps are designed to achieve high precision, which can result in no prediction for either genes or genomic features. For instance, genomic features but no genes are predicted if sequence map to non-coding regulatory DNA (promoters and enhancers). Conversely, genes but no genomic features are predicted if sequences are designed to span exon–exon boundaries such as morpholinos or primers for quantitative RT-PCR.</p>
</sec>
<sec id="SEC2.3">
<title>2.3 Data analysis</title>
<p>GenBank records were used to generate a benchmark set of links between articles and species or genes. The PubMed document ID and organism of a submission were parsed directly from GenBank records using the chronologically first article associated with the record (i.e. the last or second-to-last ‘REFERENCE’ entry). GenBank accession numbers in full-text articles were identified using the following set of regular expressions: (([A-Z]{1}[0–9]{5})|([A–Z]{2}[0−9]{6})|([A–Z]{4}[0−9]{8,9})|([A–Z]{5}[0−9]{7}))(\[0–9]{1,3}). Precision was defined as the number of species–article or gene–article predictions by text2genome that matched at least one species–article or gene–article association defined by the GenBank record, as a proportion of the total number of text2genome predictions. Recall was defined as the number of species–article or gene–article predictions by text2genome that matched the species–article or gene–article associations defined by the GenBank record, as a proportion of the total number of associations defined in the sample of GenBank records.</p>
<p>To obtain the most likely Ensembl gene identifier for each GenBank record, BLAT was used to map each GenBank sequence to Ensembl/EnsemblGenomes transcripts, keeping only the best matching gene ID. As non-high-throughput divisions still can contain submissions with several thousand sequences, we filtered this set to retain sequences from small-scale analyses only and therefore removed articles that submitted more than 100 sequences. The resulting table contains articles identified by their PubMed ID and the genomes and predicted genes that were submitted to GenBank, represented by their NCBI Taxonomy ID and the Ensembl Gene ID (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 2</ext-link>
).</p>
<p>ChIP-seq data from
<xref ref-type="bibr" rid="B28">Visel
<italic>et al.</italic>
(2009</xref>
) was obtained from NCBI GEO Accession GSM348064. p300-bound peaks were mapped to the most current mouse genome assembly (mm9) and loaded into the UCSC genome browser (
<xref ref-type="bibr" rid="B23">Rhead
<italic>et al.</italic>
, 2010</xref>
) as a custom track in combination with the text2genome mm9 genome annotation track. ChIP-seq data was intersected with the full-text extracted sequences using the ‘overlap’ function of the UCSC table browser.</p>
<p>We retrieved articles describing RT-PCR-related experiments by using NCBI Entrez Programming Utilities to query PubMed abstracts or PMC full-text articles with the following query: ‘qpcr’ OR ‘q-pcr’ OR ‘qrt-pcr’ OR ‘quantitative pcr’ OR ‘quantitative poly*’ or ‘quantitative polymerase’ OR ‘quantitative realtime pcr’ OR ‘reverse transcription polymerase’ OR ‘reverse transcription pcr’ OR ‘rtpcr’ OR ‘rt-pcr’ OR ‘rt-qpcr’ OR ‘rtq-pcr’. A list of common RT-PCR control loci was obtained from (
<xref ref-type="bibr" rid="B27">Vandesompele
<italic>et al.</italic>
, 2002</xref>
). From this list, only the prefixes were used to account for different gene names in non-mammalian model organisms. Genes were thereby counted as RT-PCR control genes if they start with one of the following prefixes: ACT, B2M, GAPD, HMBS, HPRT1, RPL13, SDHA, TBP, UBC and YWHAZ.</p>
</sec>
<sec id="SEC2.4">
<title>2.4 Implementation</title>
<p>All extracted sequences, BLAT matches and genome–gene associations generated by text2genome are stored in a MySQL database. Custom Python CGI scripts render data into HTML pages and act as a light-weight Distributed Annotation System (DAS;
<xref ref-type="bibr" rid="B6">Dowell
<italic>et al.</italic>
, 2001</xref>
) server, making it possible to overlay the matches onto the Ensembl Genome Browser and provide metadata including links to the corresponding articles. An additional script exports the same data in Browser Extensible Data (BED) format, allowing visualization and filtering of chained BLAT matches on the UCSC genome browser. Extracted sequences, predicted genes, browser tracks and additional metadata, such as gene names recognized in full-text articles by GNAT (
<xref ref-type="bibr" rid="B14">Hakenberg
<italic>et al.</italic>
, 2008</xref>
), can be searched, downloaded and browsed at
<ext-link ext-link-type="uri" xlink:href="http://www.text2genome.org">http://www.text2genome.org</ext-link>
. Source code for the text extraction, mapping and display are available as a set of Python 2.4 scripts that can be downloaded from
<ext-link ext-link-type="uri" xlink:href="http://sourceforge.net/projects/text2genome/">http://sourceforge.net/projects/text2genome/</ext-link>
.</p>
</sec>
</sec>
<sec sec-type="results" id="SEC3">
<title>3 RESULTS</title>
<sec id="SEC3.1">
<title>3.1 Full-text articles contain a wealth of DNA sequences that are not in GenBank</title>
<p>We extracted DNA sequences from 153 513 full-text research articles and their associated supplementary files in the OA subset of PMC (downloaded June 2010; later referred to as PMC-OA) using a procedure similar to the one that we presented previously in
<xref ref-type="bibr" rid="B1">Aerts
<italic>et al.</italic>
(2008</xref>
). Briefly, text was first stripped of XML tags and non-letter characters, after which all words with an [A,C,T,G,U] content greater than a threshold value were concatenated and output if the resulting sequence exceeded a length cutoff (see
<xref ref-type="sec" rid="SEC2">Section 2</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 1</ext-link>
for details). Using this algorithm, we obtained 350 888 nucleotide-like strings with an average length of 115.81 bp and a total size of 40.6 Mb (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 3</ext-link>
). The mode of the length distribution of individual sequences is 20 bp, and sequences more frequently occur in multiples of 2, suggesting that many sequences are PCR primers (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 4</ext-link>
).</p>
<p>In total, 22.6% (34 828/153 513) of all research articles in PMC-OA contain sequence-like strings. The proportion of articles containing sequences in PMC-OA reached a peak of ∼45% in the mid-1990s and has subsequently levelled at just > 20% (
<xref ref-type="fig" rid="F1">Fig. 1</xref>
). Over 33% (119 281/350 888) of sequences and the majority of nucleotides (64%, 26.0/40.6 Mb) were extracted from supplementary files. Only 9.7% (3381/34 828) of these articles contain sequences exclusively in their supplementary files. The majority of extracted strings are likely to be
<italic>bona fide</italic>
DNA sequences, since out of 3443 articles published before 1960 [i.e. prior to the advent of nucleic acid sequencing (
<xref ref-type="bibr" rid="B15">Holley
<italic>et al.</italic>
, 1965</xref>
)], only one article contains a sequence-like string (which was caused by an OCR error). To further validate our method, we manually inspected a randomly selected 1% subset of the nucleotide strings and found that all were valid DNA sequences, implying that most extracted strings represent true DNA sequences.
<fig id="F1" position="float">
<label>Fig. 1.</label>
<caption>
<p>Trends in sequence extraction from full-text articles. Lines show the growth of the PMC-OA subset from 1970 through 2009, together with the number and proportion of articles that contain DNA sequences in their full text or supplementary files, and the number of articles that could be linked to non-high-throughput GenBank submissions.</p>
</caption>
<graphic xlink:href="btr043f1"></graphic>
</fig>
</p>
<p>To compare the proportion of articles with DNA sequences to the proportion of articles accompanied by a GenBank submission, we estimated the number of PMC-OA articles that have a non-high-throughput GenBank record. Overall, we found that 6.7% (10 378/153 513) of PMC-OA articles are linkable to a GenBank submission (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 5</ext-link>
), and that this number of articles has remained relatively low over time (
<xref ref-type="fig" rid="F1">Fig. 1</xref>
). As expected, we can extract nucleotides from the full text of the majority of PMC-OA articles with a GenBank submission (76.5%, 7937/10 378). Surprisingly, 77.2% (26 891/34 828) of articles with extractable nucleotides in their full text are not linkable to a GenBank submission. This result implies that the majority of sequences extracted from full-text articles have not been submitted to any nucleotide data bank.</p>
</sec>
<sec id="SEC3.2">
<title>3.2 Sequences link articles to species, genes and genomic locations with high precision</title>
<p>DNA sequences extracted from text do not themselves contain information about the species or gene from which they were obtained, but instead must be mapped to other annotated sequences in order to propagate this meta-information to the articles in which they were found. Therefore, we searched all sequences extracted from text against all genome and transcript sequences in the Ensembl and EnsemblGenomes databases and resolved the best matching species, gene and genomic region (see
<xref ref-type="sec" rid="SEC2">Section 2</xref>
and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 1</ext-link>
for details). An example of a genome browser view of the text2genome mappings for a region of the mouse genome containing the tumor necrosis factor (Tnf) gene is shown in
<xref ref-type="fig" rid="F2">Figure 2</xref>
. Overall, 79.3% of articles with sequences (27 632/34 828) lead to a single best species prediction that is based on hits to a best matching genomic region (99.3%, 27 452/27 632) and/or a best matching gene (40.0%, 9935/27 632). Roughly one-third of all article–species associations are based on both a best matching genomic region and a best matching gene hit (35.3%, 9755/27 632), indicating that many sequences extracted from articles map to intronic and intergenic regions. In total, these articles generate 247 007 unique associations between articles and genomic regions (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 6</ext-link>
) and 23 388 unique gene–article associations (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 7</ext-link>
). The ∼21% of articles with sequences that do not lead to a best genome prediction at all arise from sequences in repeated regions, from cloning/sequencing vectors, or species not currently represented in Ensembl or EnsemblGenomes.
<fig id="F2" position="float">
<label>Fig. 2.</label>
<caption>
<p>Example of text2genome mappings for the mouse Tnf region. Exons for RefSeq gene models of Tnf and lymphotoxin A (Lta) are shown as grey rectangles below, and chained BLAT hits from text2genome mappings are shown as black rectangles above. Note that the majority of mapped papers contain pairs of sequences that are consistent with being PCR primers. The two larger mapped sequences come from original publications reporting Tnf and Lta primary sequences (
<xref ref-type="bibr" rid="B13">Gray
<italic>et al.</italic>
, 1987</xref>
;
<xref ref-type="bibr" rid="B25">Semon
<italic>et al.</italic>
, 1987</xref>
).</p>
</caption>
<graphic xlink:href="btr043f2"></graphic>
</fig>
</p>
<p>To evaluate the accuracy of text2genome species and gene mappings, we used as a reference the set of articles where the original authors submitted sequences to GenBank. We chose not to use data from Entrez Gene as a gold standard since it is part of our pipeline, and since Entrez Gene may curate species or genes that are mentioned in an article but for which no sequences are reported. The text2genome-inferred gene or species was considered to be correct if it matched any of the Genbank-derived information for an article. We attempted to filter out articles from this evaluation set that reported either (i) high-throughput sequencing results (e.g. expressed sequence tag projects) by excluding articles with more than 100 submitted sequences, or (ii) genome-scale sequence contigs, by limiting the length of the GenBank sequence to 1 Mb. This resulted in a dataset with the species and the best matching gene for 4800 articles based on GenBank submissions (
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 2</ext-link>
).</p>
<p>As with GenBank submissions, the number of predicted species and genes can vary for text2genome predictions. By limiting the number of text2genome species or gene predictions for a given article, we observed that all performance measures with the exception of species precision decrease with increasing numbers of predictions per article (
<xref ref-type="table" rid="T1">Table 1</xref>
). The most easily interpretable understanding of the true performance of text2genome can be obtained when documents are limited to those with one predicted species/gene and one reference species/gene. In this case, each false positive prediction creates an associated false negative and precision equals recall, and the accuracy of species prediction is 96%, while the accuracy of gene prediction is 91%. When we allow greater than one prediction per article, recall becomes lower than precision, reflecting the fact that not all sequences in a GenBank submission are reported in the full text, confirming the intended role of nucleotide databases as repositories that complement the main publication. At all cutoffs, species predictions are better in terms of precision and recall than gene predictions.
<table-wrap id="T1" position="float">
<label>Table 1.</label>
<caption>
<p>Evaluation of text2genome (t2g) and GNAT species and gene predictions against associated GenBank submissions</p>
</caption>
<table frame="hsides" rules="groups">
<thead align="left">
<tr>
<th rowspan="1" colspan="1">Set</th>
<th rowspan="1" colspan="1">Cutoff</th>
<th rowspan="1" colspan="1">
<italic>N</italic>
</th>
<th rowspan="1" colspan="1">TP</th>
<th rowspan="1" colspan="1">FP</th>
<th rowspan="1" colspan="1">FN</th>
<th rowspan="1" colspan="1">Precision</th>
<th rowspan="1" colspan="1">Recall</th>
</tr>
</thead>
<tbody align="left">
<tr>
<td rowspan="1" colspan="1">t2g species</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">1248</td>
<td rowspan="1" colspan="1">1201</td>
<td rowspan="1" colspan="1">47</td>
<td rowspan="1" colspan="1">47</td>
<td rowspan="1" colspan="1">0.96</td>
<td rowspan="1" colspan="1">0.96</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">1334</td>
<td rowspan="1" colspan="1">1279</td>
<td rowspan="1" colspan="1">55</td>
<td rowspan="1" colspan="1">173</td>
<td rowspan="1" colspan="1">0.96</td>
<td rowspan="1" colspan="1">0.88</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">1338</td>
<td rowspan="1" colspan="1">1283</td>
<td rowspan="1" colspan="1">55</td>
<td rowspan="1" colspan="1">197</td>
<td rowspan="1" colspan="1">0.96</td>
<td rowspan="1" colspan="1">0.87</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">1338</td>
<td rowspan="1" colspan="1">1283</td>
<td rowspan="1" colspan="1">55</td>
<td rowspan="1" colspan="1">197</td>
<td rowspan="1" colspan="1">0.96</td>
<td rowspan="1" colspan="1">0.87</td>
</tr>
<tr>
<td rowspan="1" colspan="1">t2g genes</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">890</td>
<td rowspan="1" colspan="1">814</td>
<td rowspan="1" colspan="1">76</td>
<td rowspan="1" colspan="1">76</td>
<td rowspan="1" colspan="1">0.91</td>
<td rowspan="1" colspan="1">0.91</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">1647</td>
<td rowspan="1" colspan="1">1223</td>
<td rowspan="1" colspan="1">424</td>
<td rowspan="1" colspan="1">457</td>
<td rowspan="1" colspan="1">0.74</td>
<td rowspan="1" colspan="1">0.73</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">1813</td>
<td rowspan="1" colspan="1">1278</td>
<td rowspan="1" colspan="1">535</td>
<td rowspan="1" colspan="1">592</td>
<td rowspan="1" colspan="1">0.70</td>
<td rowspan="1" colspan="1">0.68</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">2017</td>
<td rowspan="1" colspan="1">1325</td>
<td rowspan="1" colspan="1">692</td>
<td rowspan="1" colspan="1">895</td>
<td rowspan="1" colspan="1">0.66</td>
<td rowspan="1" colspan="1">0.60</td>
</tr>
<tr>
<td rowspan="1" colspan="1">GNAT species</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">518</td>
<td rowspan="1" colspan="1">373</td>
<td rowspan="1" colspan="1">145</td>
<td rowspan="1" colspan="1">145</td>
<td rowspan="1" colspan="1">0.72</td>
<td rowspan="1" colspan="1">0.72</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">1793</td>
<td rowspan="1" colspan="1">867</td>
<td rowspan="1" colspan="1">926</td>
<td rowspan="1" colspan="1">301</td>
<td rowspan="1" colspan="1">0.48</td>
<td rowspan="1" colspan="1">0.74</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">1809</td>
<td rowspan="1" colspan="1">875</td>
<td rowspan="1" colspan="1">934</td>
<td rowspan="1" colspan="1">324</td>
<td rowspan="1" colspan="1">0.48</td>
<td rowspan="1" colspan="1">0.73</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">1809</td>
<td rowspan="1" colspan="1">875</td>
<td rowspan="1" colspan="1">934</td>
<td rowspan="1" colspan="1">324</td>
<td rowspan="1" colspan="1">0.48</td>
<td rowspan="1" colspan="1">0.73</td>
</tr>
<tr>
<td rowspan="1" colspan="1">GNAT genes</td>
<td rowspan="1" colspan="1">1</td>
<td rowspan="1" colspan="1">143</td>
<td rowspan="1" colspan="1">73</td>
<td rowspan="1" colspan="1">70</td>
<td rowspan="1" colspan="1">70</td>
<td rowspan="1" colspan="1">0.51</td>
<td rowspan="1" colspan="1">0.51</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">5</td>
<td rowspan="1" colspan="1">1489</td>
<td rowspan="1" colspan="1">261</td>
<td rowspan="1" colspan="1">1228</td>
<td rowspan="1" colspan="1">393</td>
<td rowspan="1" colspan="1">0.18</td>
<td rowspan="1" colspan="1">0.40</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">10</td>
<td rowspan="1" colspan="1">3682</td>
<td rowspan="1" colspan="1">429</td>
<td rowspan="1" colspan="1">3253</td>
<td rowspan="1" colspan="1">724</td>
<td rowspan="1" colspan="1">0.12</td>
<td rowspan="1" colspan="1">0.37</td>
</tr>
<tr>
<td rowspan="1" colspan="1"></td>
<td rowspan="1" colspan="1">100</td>
<td rowspan="1" colspan="1">7591</td>
<td rowspan="1" colspan="1">627</td>
<td rowspan="1" colspan="1">6964</td>
<td rowspan="1" colspan="1">1092</td>
<td rowspan="1" colspan="1">0.08</td>
<td rowspan="1" colspan="1">0.36</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Cutoff refers to the number of predictions allowed for text2genome or GNAT predictions and the GenBank evaluation set.
<italic>N</italic>
refers to the number of predicted species–article or gene–article associations for each method and for the GenBank evaluation set. TP, FP, and FN refer to true positives, false positives and false negatives, respectively. Precision is defined as TP/(TP + FP) and recall is defined as TP/(TP + FN).</p>
</fn>
</table-wrap-foot>
</table-wrap>
</p>
<p>We analyzed the distribution of species and genes in our dataset to provide an overview of the taxonomic and genomic data extracted. As expected, we found that sequences in full-text articles most frequently map to the human and mouse genomes, as well as other organisms used in basic or agricultural genetics (
<xref ref-type="fig" rid="F3">Fig. 3</xref>
). Trends in species identified using sequences in full text are largely consistent with the relative proportion of species in PMC-OA articles with GenBank records and the relative proportion of species mentions that can be extracted by LINNAEUS (
<xref ref-type="bibr" rid="B11">Gerner
<italic>et al.</italic>
, 2010</xref>
). Only 2 of the top 10 species mentioned in the entire set of MEDLINE abstracts (
<xref ref-type="bibr" rid="B11">Gerner
<italic>et al.</italic>
, 2010</xref>
) are missing from the top 10 text2genome best species list: HIV and dog.
<fig id="F3" position="float">
<label>Fig. 3.</label>
<caption>
<p>Top 10 species identified by text2genome. Shown are the proportions of articles matching sequences extracted from the full text (text2genome), with sequence submissions in GenBank, or with mentions of the species name in the full text (LINNAEUS) for species with a sequenced genome in Ensembl/EnsemblGenomes.</p>
</caption>
<graphic xlink:href="btr043f3"></graphic>
</fig>
</p>
<p>In total, text2genome identified 12 655 genes in 9935 articles, the majority of which (6907, 70%) are from the human and mouse genomes. For sequences mapping to human and mouse, the most frequently hit genes are ACTB and GAPDH (
<xref ref-type="fig" rid="F4">Fig. 4</xref>
), which are well-known control loci for RT-PCR experiments.
<fig id="F4" position="float">
<label>Fig. 4.</label>
<caption>
<p>Top 20 genes identified by text2genome. Shown are the number of articles with text2genome hits in PMC-OA for the human and mouse genome.</p>
</caption>
<graphic xlink:href="btr043f4"></graphic>
</fig>
</p>
<p>Other genes with sequences frequently reported in the literature are from heavily investigated loci involved in immunity (TNF, IL6, IFNG) or cancer (P53, MYC). Detected genes are spread over a substantial proportion of human and mouse genomes. For example, articles linked to the human genome cover 21.1% of the 19 814 human and 13.4% of the 20 192 mouse gene models that are listed in Ensembl 56 and Entrez Genes.</p>
<p>Finally, we compared results of gene–article associations from text2genome with results of the gene name identification and normalization software GNAT (
<xref ref-type="bibr" rid="B14">Hakenberg
<italic>et al.</italic>
, 2008</xref>
;
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">Supplementary File 8</ext-link>
). GNAT detected at least one gene name in 73.2% (112 445/153 513) of PMC-OA articles, a rate approximately 3 times higher than articles with DNA sequences and more than 10 times higher than articles with genes found by text2genome. Likewise, genes identified by GNAT cover approximately three times more of the human (64.9%) and mouse (31.4%) genomes than text2genome. On our GenBank-derived benchmark, GNAT detected 7591 gene names in the full text of 1072 articles. For 50.9% of these articles (546/1072), one of these gene names corresponded to the gene mapped to by a sequence submitted to Genbank. Using the same evaluation criteria as for text2genome, we found that all GNAT performance measures except species recall also decrease with increasing numbers of predictions (
<xref ref-type="table" rid="T1">Table 1</xref>
). In all cases, text2genome outperforms GNAT for species or gene prediction in terms of precision and recall. For documents with only one prediction, the accuracy of GNAT predictions is 72% for species and 51% for genes. For documents with greater than one prediction, GNAT recall is higher than precision, consistent with the higher rate of predictions per article by GNAT relative to text2genome. When predictions for both text2genome and GNAT are directly compared on all of PMC-OA, 50.4% (5192/10 294) of human and 27.2% (1534/5629) mouse gene–article associations inferred by text2genome could be corroborated by GNAT predictions.</p>
</sec>
<sec id="SEC3.3">
<title>3.3 Sequences from articles accelerate the interpretation and design of genomics experiments</title>
<p>To demonstrate the utility of text2genome for research in genetics and genomics, we highlight possible use cases in the following two examples. In addition to these examples, the dataset of sequences extracted from articles by text2genome should also be useful in many other contexts, not least for annotators of various biological databases (
<xref ref-type="bibr" rid="B1">Aerts
<italic>et al.</italic>
, 2008</xref>
;
<xref ref-type="bibr" rid="B30">Wren
<italic>et al.</italic>
, 2005</xref>
).</p>
<sec id="SEC3.3.1">
<title>3.3.1 Example 1: interpreting ChIP-seq data</title>
<p>High-throughput ChIP-seq experiments providing information on the binding of transcription factors to thousands of loci can only be properly interpreted when calibrated against positive control data. By intersecting ChIP-seq regions with genomic regions annotated by text2genome, one can automatically find articles that have previously studied ChIP-seq regions. For example,
<xref ref-type="bibr" rid="B28">Visel
<italic>et al.</italic>
(2009</xref>
) conducted ChIP-Seq against p300, a common cofactor in many transcriptional complexes, with the aim of predicting enhancers in the forebrain, midbrain and hindbrain of mouse embryos. We intersected the 5119 p300-bound fragments in this dataset with mouse text2genome genomic regions, and found a region upstream of gene Lmo1 that is bound by p300 in the forebrain and overlaps previously reported primer sequences from
<xref ref-type="bibr" rid="B8">Fulp
<italic>et al.</italic>
(2008</xref>
). These authors have shown using ChIP-PCR that Lmo1 is expressed in the mouse forebrain and that its upstream region is bound by the transcription factor ARX in neuroblastoma cell lines. The ChIP-PCR fragment covers part of the interval published by
<xref ref-type="bibr" rid="B8">Fulp
<italic>et al.</italic>
(2008</xref>
) and confirms that this p300 bound region is actively bound by a transcription factor in mouse neuronal cells. To enable streamlined automation of this type of analysis using the UCSC Table Browser (
<xref ref-type="bibr" rid="B17">Karolchik
<italic>et al.</italic>
, 2004</xref>
), we provide text2genome data as BED tracks for selected assemblies in the UCSC genome database (
<xref ref-type="fig" rid="F2">Fig. 2</xref>
).</p>
</sec>
<sec id="SEC3.3.2">
<title>3.3.2 Example 2: finding quantitative RT-PCR primers</title>
<p>Biologists using quantitative RT-PCR (
<xref ref-type="bibr" rid="B12">Gibson
<italic>et al.</italic>
, 1996</xref>
) to measure transcript levels need to select control genes and find validated primers and cycling conditions before conducting their experiments. As many of the sequences in our database map to genes that are commonly used in RT-PCR experiments (
<xref ref-type="fig" rid="F4">Fig. 4</xref>
), text2genome-extracted sequences offer a potentially rich source of validated RT-PCR primers. To evaluate this possibility and index articles for their potential utility in RT-PCR, we scanned all PMC-OA articles for keywords related to quantitative RT-PCR (see
<xref ref-type="sec" rid="SEC2">Section 2</xref>
for details). In PMC-OA, 3410 articles have RT-PCR-related keywords in their abstract, 18 912 have RT-PCR keywords in their full text or supplementary files and 1129 articles have text2genome predictions for genes that are commonly used in RT-PCR experiments. The vast majority of text2genome predictions that hit RT-PCR-related genes (81.6%, 922/1129) also have RT-PCR keywords in their full text, demonstrating that text2genome gene predictions can be useful when searching the literature during the design of an RT-PCR experiment. In contrast, only 21% of text2genome predictions that hit RT-PCR related genes (248/1129) have RT-PCR keywords in their abstracts, indicating that information about putative RT-PCR control genes cannot be obtained readily by searching abstracts alone. To aid in the selection of primers for RT-PCR, the text2genome web site offers a function to limit gene searches to articles that contain RT-PCR keywords in their full text. In this mode, the database currently shows sequences for 9045 genes from 5694 articles from 81 species. Ninety-eight percent of RT-PCR-related sequences are assigned to the human, mouse and rat genomes.</p>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="SEC4">
<title>4 DISCUSSION</title>
<p>In an age of rapidly increasing amounts of DNA sequence data and published literature, finding peer-reviewed experimental results for a sequence of interest is more time consuming than ever. Here, we show that DNA sequences in full-text articles provide a rich source of ‘unique identifiers’ that can be automatically extracted and mapped to genomic data in order to link articles to species, genes and genomic regions. We confirm recent findings that a substantial number of OA articles in PMC contain extractable DNA sequences (
<xref ref-type="bibr" rid="B10">Garcia-Remesal
<italic>et al.</italic>
, 2010b</xref>
), and provide the first quantitative estimate of the proportion of PMC-OA articles with DNA sequences (∼22%), the majority of which we show are short sequences that are not found in GenBank.</p>
<p>Our study is also unique in that it presents the first attempt to apply sequence extraction techniques at a large scale to all types of both full text and
<ext-link ext-link-type="uri" xlink:href="http://bioinformatics.oxfordjournals.org/cgi/content/full/btr043/DC1">supplementary data</ext-link>
files, and in fact may be the first systematic application of text mining to supplementary files in any domain. Our observation that the majority of nucleotides in the PMC-OA corpus were extracted from supplementary files underscores the increasing reliance of authors to deposit important information contained in these files (
<xref ref-type="bibr" rid="B29">Weiss, 2010</xref>
), as well as the importance of using these resources for biological data mining and requiring ancillary research data to be persistently stored together with the main publication (
<xref ref-type="bibr" rid="B2">Anderson
<italic>et al.</italic>
, 2006</xref>
). Future work will be necessary to determine if the quality of data from full text differs in any way from that obtained in supplementary files.</p>
<p>We find that 96% of species–article and 91% of gene–article associations predicted using text2genome match those based on GenBank submissions from articles discussing a single species or gene. When compared with a state-of-the-art text-mining method that attempts to associate articles to species or genes by named entity recognition, text2genome exhibits much higher performance than GNAT for species (72%) or gene (51%) prediction. Thus, if researchers are looking for genes specifically investigated at the molecular level in an article, our results indicate that DNA sequences in text provide a richer source of information than gene names. It is important to point out that our evaluation of these systems is benchmarked against genes from associated GenBank sequence submissions spanning a wide range or organisms. Since many more genes are mentioned in the literature than are actually studied experimentally and since GNAT only recognizes genes for a limited set of species, the performance of GNAT on our GenBank evaluation set may be reduced relative to benchmarks performed on gene names (
<xref ref-type="bibr" rid="B14">Hakenberg
<italic>et al.</italic>
, 2008</xref>
).</p>
<p>For both text2genome and GNAT, system performance is related to the number of predictions made per paper. The effects of multiple predictions are greater for genes relative to species for both systems, and influence precision and recall differentially for text2genome and GNAT. The difficulty that both systems have for gene prediction in documents that discuss many genes is consistent with the fact that human annotators do not always agree when asked to curate genes in articles [69–91% depending on the dataset (
<xref ref-type="bibr" rid="B5">Colosimo
<italic>et al.</italic>
, 2005</xref>
;
<xref ref-type="bibr" rid="B22">Morgan
<italic>et al.</italic>
, 2008</xref>
)]. Despite these differences, there is a substantial degree of overlap between GNAT and text2genome gene–article mappings for some species such as human, suggesting that future full-text mining systems could fruitfully integrate sequence extraction together with named entity recognition to predict gene–article associations (
<xref ref-type="bibr" rid="B1">Aerts
<italic>et al.</italic>
, 2008</xref>
).</p>
<p>In addition to providing bidirectional links between articles and genes or species, text2genome allows accessing the biomedical literature using the powerful tools of modern genome browsers. In this manner, text2genome joins a limited number of other hybrid text-mining/genome bioinformatics systems that provide mechanisms to interpret the biomedical literature via genome browsers, such as PosMed (
<xref ref-type="bibr" rid="B31">Yoshida
<italic>et al.</italic>
, 2009</xref>
) and LitTrack (
<ext-link ext-link-type="uri" xlink:href="http://littrack.chop.edu/cgi-bin/hgTracks">http://littrack.chop.edu/cgi-bin/hgTracks</ext-link>
). However, since PosMed and LitTrack rely on gene name recognition methods and therefore can only map articles to the gene level, genomic coordinates must be inferred indirectly by these systems, whether they are appropriate or not. By mapping at the DNA sequence level itself, text2genome can directly identify the exact set of nucleotides in a genome sequence that is analyzed in a study. This distinguishing feature of our system is critical for researchers studying non-genic sequences such as
<italic>cis</italic>
-regulatory regions or miRNA binding sites. Database curators in these and other areas could use our system to aid in the prioritization and extraction of experimental data from papers.</p>
<p>Only ∼1% of all MEDLINE articles are available at the moment for full-text mining in the OA section of PMC. If we were to mine the full text and supplementary files of all 16.5 million articles in MEDLINE from 1970 to the present, we would expect to harvest sequences from ∼3 million articles using the text2genome approach. As we work toward this goal, we hope that the results presented here encourage other free-access and subscription-model publishers to permit the extraction and mapping of DNA sequences within their articles, to the mutual benefit of researchers, database curators and publishers alike.</p>
</sec>
</body>
<back>
<ack>
<title>ACKNOWLEDGEMENTS</title>
<p>We thank Guy Cochrane, Micaela Gallozzi, Nick Gresham, Nicole Rusk, Bruno Strasser, Jonathan Wren and members of the Bergman group for contributing ideas and assistance during the project. We thank Jörg Hakenberg, Jean-Stephane Joly, Sam Griffiths-Jones, Adriaan Klinkenberg, Goran Nenadic, Pedro Olivares-Chauvet and five anonymous referees for helpful comments on the manuscript. This work was performed in part using computational facilities at the Vital-IT (
<ext-link ext-link-type="uri" xlink:href="http://www.vital-it.ch">http://www.vital-it.ch</ext-link>
) Center for high-performance computing of the Swiss Institute of Bioinformatics.</p>
<p>
<italic>Funding:</italic>
European Science Foundation (Frontiers in Functional Genomics travel award to M.H.); the
<funding-source>Biotechnology and Biological Sciences Research Council</funding-source>
(CASE studentship to M.G., ISIS travel award 2076 to C.M.B., grant
<award-id>BB/G000093/1</award-id>
to C.M.B.,
<award-id>BB/E012868/1</award-id>
to C.M.B.); the
<funding-source>European Commission</funding-source>
(grant
<award-id>HEALTH-F4-2008-223210</award-id>
to C.M.B.).</p>
<p>
<italic>Conflict of Interest</italic>
: none declared.</p>
</ack>
<ref-list>
<title>REFERENCES</title>
<ref id="B1">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aerts</surname>
<given-names>S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Text-mining assisted regulatory annotation</article-title>
<source>Genome Biol.</source>
<year>2008</year>
<volume>9</volume>
<fpage>R31</fpage>
<pub-id pub-id-type="pmid">18271954</pub-id>
</element-citation>
</ref>
<ref id="B2">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Anderson</surname>
<given-names>N.R.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>On the persistence of supplementary resources in biomedical publications</article-title>
<source>BMC Bioinformatics</source>
<year>2006</year>
<volume>7</volume>
<fpage>260</fpage>
<pub-id pub-id-type="pmid">16712726</pub-id>
</element-citation>
</ref>
<ref id="B3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benson</surname>
<given-names>D.A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>GenBank</article-title>
<source>Nucleic Acids Res.</source>
<year>2010</year>
<volume>38</volume>
<fpage>D46</fpage>
<lpage>D51</lpage>
<pub-id pub-id-type="pmid">19910366</pub-id>
</element-citation>
</ref>
<ref id="B4">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cock</surname>
<given-names>P.J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Biopython: freely available Python tools for computational molecular biology and bioinformatics</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<fpage>1422</fpage>
<lpage>1423</lpage>
<pub-id pub-id-type="pmid">19304878</pub-id>
</element-citation>
</ref>
<ref id="B5">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Colosimo</surname>
<given-names>M.E.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Data preparation and interannotator agreement: BioCreAtIvE task 1B</article-title>
<source>BMC Bioinformatics</source>
<year>2005</year>
<volume>6</volume>
<issue>Suppl. 1</issue>
<fpage>S12</fpage>
<pub-id pub-id-type="pmid">15960824</pub-id>
</element-citation>
</ref>
<ref id="B6">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dowell</surname>
<given-names>R.D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The distributed annotation system</article-title>
<source>BMC Bioinformatics</source>
<year>2001</year>
<volume>2</volume>
<fpage>7</fpage>
<pub-id pub-id-type="pmid">11667947</pub-id>
</element-citation>
</ref>
<ref id="B7">
<element-citation publication-type="journal">
<collab>The FlyBase Consortium</collab>
<article-title>The FlyBase database of the Drosophila genome projects and community literature</article-title>
<source>Nucleic Acids Res.</source>
<year>2003</year>
<volume>31</volume>
<fpage>172</fpage>
<lpage>175</lpage>
<pub-id pub-id-type="pmid">12519974</pub-id>
</element-citation>
</ref>
<ref id="B8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fulp</surname>
<given-names>C.T.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Identification of Arx transcriptional targets in the developing basal forebrain</article-title>
<source>Hum. Mol. Genet.</source>
<year>2008</year>
<volume>17</volume>
<fpage>3740</fpage>
<lpage>3760</lpage>
<pub-id pub-id-type="pmid">18799476</pub-id>
</element-citation>
</ref>
<ref id="B9">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Garcia-Remesal</surname>
<given-names>M.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A method for automatically extracting infectious disease-related primers and probes from the literature</article-title>
<source>BMC Bioinformatics</source>
<year>2010a</year>
<volume>11</volume>
<fpage>410</fpage>
<pub-id pub-id-type="pmid">20682041</pub-id>
</element-citation>
</ref>
<ref id="B10">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Garcia-Remesal</surname>
<given-names>M.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>PubDNA Finder: a web database linking full-text articles to sequences of nucleic acids</article-title>
<source>Bioinformatics</source>
<year>2010b</year>
<volume>26</volume>
<fpage>2801</fpage>
<lpage>2802</lpage>
<pub-id pub-id-type="pmid">20829445</pub-id>
</element-citation>
</ref>
<ref id="B11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gerner</surname>
<given-names>M.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>LINNAEUS: a species name identification system for biomedical literature</article-title>
<source>BMC Bioinformatics</source>
<year>2010</year>
<volume>11</volume>
<fpage>85</fpage>
<pub-id pub-id-type="pmid">20149233</pub-id>
</element-citation>
</ref>
<ref id="B12">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gibson</surname>
<given-names>U.E.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A novel method for real time quantitative RT-PCR</article-title>
<source>Genome Res.</source>
<year>1996</year>
<volume>6</volume>
<fpage>995</fpage>
<lpage>1001</lpage>
<pub-id pub-id-type="pmid">8908519</pub-id>
</element-citation>
</ref>
<ref id="B13">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gray</surname>
<given-names>P.W.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The murine tumor necrosis factor-beta (lymphotoxin) gene sequence</article-title>
<source>Nucleic Acids Res.</source>
<year>1987</year>
<volume>15</volume>
<fpage>3937</fpage>
<pub-id pub-id-type="pmid">3588316</pub-id>
</element-citation>
</ref>
<ref id="B14">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hakenberg</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Inter-species normalization of gene mentions with GNAT</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<fpage>i126</fpage>
<lpage>i132</lpage>
<pub-id pub-id-type="pmid">18689813</pub-id>
</element-citation>
</ref>
<ref id="B15">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Holley</surname>
<given-names>R.W.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Structure of a ribonucleic acid</article-title>
<source>Science</source>
<year>1965</year>
<volume>147</volume>
<fpage>1462</fpage>
<lpage>1465</lpage>
<pub-id pub-id-type="pmid">14263761</pub-id>
</element-citation>
</ref>
<ref id="B16">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hubbard</surname>
<given-names>T.J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ensembl 2009</article-title>
<source>Nucleic Acids Res.</source>
<year>2009</year>
<volume>37</volume>
<fpage>D690</fpage>
<lpage>D697</lpage>
<pub-id pub-id-type="pmid">19033362</pub-id>
</element-citation>
</ref>
<ref id="B17">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karolchik</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The UCSC Table Browser data retrieval tool</article-title>
<source>Nucleic Acids Res.</source>
<year>2004</year>
<volume>32</volume>
<fpage>D493</fpage>
<lpage>D496</lpage>
<pub-id pub-id-type="pmid">14681465</pub-id>
</element-citation>
</ref>
<ref id="B18">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kent</surname>
<given-names>W.J.</given-names>
</name>
</person-group>
<article-title>BLAT–the BLAST-like alignment tool</article-title>
<source>Genome Res.</source>
<year>2002</year>
<volume>12</volume>
<fpage>656</fpage>
<lpage>664</lpage>
<pub-id pub-id-type="pmid">11932250</pub-id>
</element-citation>
</ref>
<ref id="B19">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kersey</surname>
<given-names>P.J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Ensembl genomes: extending Ensembl across the taxonomic space</article-title>
<source>Nucleic Acids Res.</source>
<year>2010</year>
<volume>38</volume>
<fpage>D563</fpage>
<lpage>D569</lpage>
<pub-id pub-id-type="pmid">19884133</pub-id>
</element-citation>
</ref>
<ref id="B20">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krallinger</surname>
<given-names>M.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Linking genes to literature: text mining, information extraction, and retrieval applications for biology</article-title>
<source>Genome Biol.</source>
<year>2008</year>
<volume>9</volume>
<issue>Suppl. 2</issue>
<fpage>S8</fpage>
<pub-id pub-id-type="pmid">18834499</pub-id>
</element-citation>
</ref>
<ref id="B21">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maglott</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Entrez Gene: gene-centered information at NCBI</article-title>
<source>Nucleic Acids Res.</source>
<year>2007</year>
<volume>35</volume>
<fpage>D26</fpage>
<lpage>D31</lpage>
<pub-id pub-id-type="pmid">17148475</pub-id>
</element-citation>
</ref>
<ref id="B22">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morgan</surname>
<given-names>A.A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Overview of BioCreative II gene normalization</article-title>
<source>Genome Biol.</source>
<year>2008</year>
<volume>9</volume>
<issue>Suppl. 2</issue>
<fpage>S3</fpage>
<pub-id pub-id-type="pmid">18834494</pub-id>
</element-citation>
</ref>
<ref id="B23">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rhead</surname>
<given-names>B.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The UCSC genome browser database: update 2010</article-title>
<source>Nucleic Acids Res.</source>
<year>2010</year>
<volume>38</volume>
<fpage>D613</fpage>
<lpage>D619</lpage>
<pub-id pub-id-type="pmid">19906737</pub-id>
</element-citation>
</ref>
<ref id="B24">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roberts</surname>
<given-names>R.J.</given-names>
</name>
</person-group>
<article-title>PubMed central: the GenBank of the published literature</article-title>
<source>Proc. Natl Acad. Sci. USA</source>
<year>2001</year>
<volume>98</volume>
<fpage>381</fpage>
<lpage>382</lpage>
<pub-id pub-id-type="pmid">11209037</pub-id>
</element-citation>
</ref>
<ref id="B25">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Semon</surname>
<given-names>D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Nucleotide sequence of the murine TNF locus, including the TNF-alpha (tumor necrosis factor) and TNF-beta (lymphotoxin) genes</article-title>
<source>Nucleic Acids Res.</source>
<year>1987</year>
<volume>15</volume>
<fpage>9083</fpage>
<lpage>9084</lpage>
<pub-id pub-id-type="pmid">3684584</pub-id>
</element-citation>
</ref>
<ref id="B26">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shtatland</surname>
<given-names>T.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>PepBank - a database of peptides based on sequence text mining and public peptide data sources</article-title>
<source>BMC Bioinformatics</source>
<year>2007</year>
<volume>8</volume>
<fpage>280</fpage>
<pub-id pub-id-type="pmid">17678535</pub-id>
</element-citation>
</ref>
<ref id="B27">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vandesompele</surname>
<given-names>J.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes</article-title>
<source>Genome Biol.</source>
<year>2002</year>
<volume>3</volume>
<comment>RESEARCH0034</comment>
</element-citation>
</ref>
<ref id="B28">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Visel</surname>
<given-names>A.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ChIP-seq accurately predicts tissue-specific activity of enhancers</article-title>
<source>Nature</source>
<year>2009</year>
<volume>457</volume>
<fpage>854</fpage>
<lpage>858</lpage>
<pub-id pub-id-type="pmid">19212405</pub-id>
</element-citation>
</ref>
<ref id="B29">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Weiss</surname>
<given-names>M. S.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Citations in supplementary material</article-title>
<source>Acta Cryst.</source>
<year>2010</year>
<volume>D66</volume>
<fpage>1269</fpage>
<lpage>1270</lpage>
</element-citation>
</ref>
<ref id="B30">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wren</surname>
<given-names>J.D.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Markov model recognition and classification of DNA/protein sequences within large text databases</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>4046</fpage>
<lpage>4053</lpage>
<pub-id pub-id-type="pmid">16159926</pub-id>
</element-citation>
</ref>
<ref id="B31">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yoshida</surname>
<given-names>Y.</given-names>
</name>
<etal></etal>
</person-group>
<article-title>PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning</article-title>
<source>Nucleic Acids Res.</source>
<year>2009</year>
<volume>37</volume>
<fpage>W147</fpage>
<lpage>W152</lpage>
<pub-id pub-id-type="pmid">19468046</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000157 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000157 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:3065681
   |texte=   Annotating genes and genomes with DNA sequences extracted from biomedical articles
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:21325301" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024