Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000076 ( Pmc/Corpus ); précédent : 0000759; suivant : 0000770 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study</title>
<author>
<name sortKey="Read, Kevin B" sort="Read, Kevin B" uniqKey="Read K" first="Kevin B." last="Read">Kevin B. Read</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>Medical Library, NYU Langone Medical Center, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sheehan, Jerry R" sort="Sheehan, Jerry R" uniqKey="Sheehan J" first="Jerry R." last="Sheehan">Jerry R. Sheehan</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huerta, Michael F" sort="Huerta, Michael F" uniqKey="Huerta M" first="Michael F." last="Huerta">Michael F. Huerta</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Knecht, Lou S" sort="Knecht, Lou S" uniqKey="Knecht L" first="Lou S." last="Knecht">Lou S. Knecht</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mork, James G" sort="Mork, James G" uniqKey="Mork J" first="James G." last="Mork">James G. Mork</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Humphreys, Betsy L" sort="Humphreys, Betsy L" uniqKey="Humphreys B" first="Betsy L." last="Humphreys">Betsy L. Humphreys</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26207759</idno>
<idno type="pmc">4514623</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4514623</idno>
<idno type="RBID">PMC:4514623</idno>
<idno type="doi">10.1371/journal.pone.0132735</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000076</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study</title>
<author>
<name sortKey="Read, Kevin B" sort="Read, Kevin B" uniqKey="Read K" first="Kevin B." last="Read">Kevin B. Read</name>
<affiliation>
<nlm:aff id="aff001">
<addr-line>Medical Library, NYU Langone Medical Center, New York, New York, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sheehan, Jerry R" sort="Sheehan, Jerry R" uniqKey="Sheehan J" first="Jerry R." last="Sheehan">Jerry R. Sheehan</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huerta, Michael F" sort="Huerta, Michael F" uniqKey="Huerta M" first="Michael F." last="Huerta">Michael F. Huerta</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Knecht, Lou S" sort="Knecht, Lou S" uniqKey="Knecht L" first="Lou S." last="Knecht">Lou S. Knecht</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Mork, James G" sort="Mork, James G" uniqKey="Mork J" first="James G." last="Mork">James G. Mork</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Humphreys, Betsy L" sort="Humphreys, Betsy L" uniqKey="Humphreys B" first="Betsy L." last="Humphreys">Betsy L. Humphreys</name>
<affiliation>
<nlm:aff id="aff002">
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec id="sec001">
<title>Objective</title>
<p>This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are “invisible” or not deposited in a known repository.</p>
</sec>
<sec id="sec002">
<title>Methods</title>
<p>We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article.</p>
</sec>
<sec id="sec003">
<title>Results</title>
<p>About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects.</p>
</sec>
<sec id="sec004">
<title>Conclusion</title>
<p>In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a “dataset,” determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Chan, Aw" uniqKey="Chan A">AW Chan</name>
</author>
<author>
<name sortKey="Song, F" uniqKey="Song F">F Song</name>
</author>
<author>
<name sortKey="Vickers, A" uniqKey="Vickers A">A Vickers</name>
</author>
<author>
<name sortKey="Jefferson, T" uniqKey="Jefferson T">T Jefferson</name>
</author>
<author>
<name sortKey="Dickersin, K" uniqKey="Dickersin K">K Dickersin</name>
</author>
<author>
<name sortKey="G Tzsche, Pc" uniqKey="G Tzsche P">PC Gøtzsche</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
<author>
<name sortKey="Wilbur, Wj" uniqKey="Wilbur W">WJ Wilbur</name>
</author>
<author>
<name sortKey="Lu, Z" uniqKey="Lu Z">Z Lu</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Margolis, R" uniqKey="Margolis R">R Margolis</name>
</author>
<author>
<name sortKey="Derr, L" uniqKey="Derr L">L Derr</name>
</author>
<author>
<name sortKey="Dunn, M" uniqKey="Dunn M">M Dunn</name>
</author>
<author>
<name sortKey="Huerta, M" uniqKey="Huerta M">M Huerta</name>
</author>
<author>
<name sortKey="Larkin, J" uniqKey="Larkin J">J Larkin</name>
</author>
<author>
<name sortKey="Sheehan, J" uniqKey="Sheehan J">J Sheehan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alsheikh Ali, Aa" uniqKey="Alsheikh Ali A">AA Alsheikh-Ali</name>
</author>
<author>
<name sortKey="Qureshi, W" uniqKey="Qureshi W">W Qureshi</name>
</author>
<author>
<name sortKey="Al Mallah, Mh" uniqKey="Al Mallah M">MH Al-Mallah</name>
</author>
<author>
<name sortKey="Ioannidis, Jp" uniqKey="Ioannidis J">JP Ioannidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mooney, H" uniqKey="Mooney H">H Mooney</name>
</author>
<author>
<name sortKey="Newton, Mp" uniqKey="Newton M">MP Newton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Belter, Cw" uniqKey="Belter C">CW Belter</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Piwowar, Ha" uniqKey="Piwowar H">HA Piwowar</name>
</author>
<author>
<name sortKey="Carlson, D" uniqKey="Carlson D">D Carlson</name>
</author>
<author>
<name sortKey="Vision, Tj" uniqKey="Vision T">TJ Vision</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ari O, A" uniqKey="Ari O A">A. Ariño</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ross, Js" uniqKey="Ross J">JS Ross</name>
</author>
<author>
<name sortKey="Tse, T" uniqKey="Tse T">T Tse</name>
</author>
<author>
<name sortKey="Zarin, Da" uniqKey="Zarin D">DA Zarin</name>
</author>
<author>
<name sortKey="Xu, H" uniqKey="Xu H">H Xu</name>
</author>
<author>
<name sortKey="Zhou, L" uniqKey="Zhou L">L Zhou</name>
</author>
<author>
<name sortKey="Krumholz, Hm" uniqKey="Krumholz H">HM Krumholz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vines, Th" uniqKey="Vines T">TH Vines</name>
</author>
<author>
<name sortKey="Albert, Ay" uniqKey="Albert A">AY Albert</name>
</author>
<author>
<name sortKey="Andrew, Rl" uniqKey="Andrew R">RL Andrew</name>
</author>
<author>
<name sortKey="Debarre, F" uniqKey="Debarre F">F Débarre</name>
</author>
<author>
<name sortKey="Bock, Dg" uniqKey="Bock D">DG Bock</name>
</author>
<author>
<name sortKey="Franklin, Mt" uniqKey="Franklin M">MT Franklin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hinchliff, Ce" uniqKey="Hinchliff C">CE Hinchliff</name>
</author>
<author>
<name sortKey="Smith, Sa" uniqKey="Smith S">SA Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Robinson Garcia, N" uniqKey="Robinson Garcia N">N Robinson-Garcia</name>
</author>
<author>
<name sortKey="Jimenez Contreras, E" uniqKey="Jimenez Contreras E">E Jimenez-Contreras</name>
</author>
<author>
<name sortKey="Torres Salinas, D" uniqKey="Torres Salinas D">D Torres-Salinas</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Parson, Ma" uniqKey="Parson M">MA Parson</name>
</author>
<author>
<name sortKey="Duerr, R" uniqKey="Duerr R">R Duerr</name>
</author>
<author>
<name sortKey="Minster, Jb" uniqKey="Minster J">JB Minster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Callaghan, S" uniqKey="Callaghan S">S Callaghan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lynch, C" uniqKey="Lynch C">C Lynch</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lindberg, Da" uniqKey="Lindberg D">DA Lindberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Thoma, Gr" uniqKey="Thoma G">GR Thoma</name>
</author>
<author>
<name sortKey="Ford, G" uniqKey="Ford G">G Ford</name>
</author>
<author>
<name sortKey="Antani, S" uniqKey="Antani S">S Antani</name>
</author>
<author>
<name sortKey="Demner Fushman, D" uniqKey="Demner Fushman D">D Demner-Fushman</name>
</author>
<author>
<name sortKey="Chung, M" uniqKey="Chung M">M Chung</name>
</author>
<author>
<name sortKey="Simpson, M" uniqKey="Simpson M">M Simpson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mons, B" uniqKey="Mons B">B Mons</name>
</author>
<author>
<name sortKey="Van Haagen, H" uniqKey="Van Haagen H">H van Haagen</name>
</author>
<author>
<name sortKey="Chichester, C" uniqKey="Chichester C">C Chichester</name>
</author>
<author>
<name sortKey="Hoen, Pb" uniqKey="Hoen P">PB Hoen</name>
</author>
<author>
<name sortKey="Den Dunnen, Jt" uniqKey="Den Dunnen J">JT den Dunnen</name>
</author>
<author>
<name sortKey="Van Ommen, G" uniqKey="Van Ommen G">G van Ommen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chavan, V" uniqKey="Chavan V">V Chavan</name>
</author>
<author>
<name sortKey="Penev, L" uniqKey="Penev L">L Penev</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Costello, Mj" uniqKey="Costello M">MJ Costello</name>
</author>
<author>
<name sortKey="Michener, Wk" uniqKey="Michener W">WK Michener</name>
</author>
<author>
<name sortKey="Gahegan, M" uniqKey="Gahegan M">M Gahegan</name>
</author>
<author>
<name sortKey="Zhang, Zq" uniqKey="Zhang Z">ZQ Zhang</name>
</author>
<author>
<name sortKey="Bourne, Pe" uniqKey="Bourne P">PE Bourne</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rousidis, D" uniqKey="Rousidis D">D Rousidis</name>
</author>
<author>
<name sortKey="Garoufallou, E" uniqKey="Garoufallou E">E Garoufallou</name>
</author>
<author>
<name sortKey="Balatsoukas, P" uniqKey="Balatsoukas P">P Balatsoukas</name>
</author>
<author>
<name sortKey="Sicilia, Ma" uniqKey="Sicilia M">MA Sicilia</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26207759</article-id>
<article-id pub-id-type="pmc">4514623</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0132735</article-id>
<article-id pub-id-type="publisher-id">PONE-D-15-00963</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study</article-title>
<alt-title alt-title-type="running-head">Improving Discovery and Access to NIH-Funded Data: A Preliminary Study</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Read</surname>
<given-names>Kevin B.</given-names>
</name>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
<xref rid="cor001" ref-type="corresp">*</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Sheehan</surname>
<given-names>Jerry R.</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Huerta</surname>
<given-names>Michael F.</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Knecht</surname>
<given-names>Lou S.</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Mork</surname>
<given-names>James G.</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name>
<surname>Humphreys</surname>
<given-names>Betsy L.</given-names>
</name>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<collab>NIH Big Data Annotator Group</collab>
<xref ref-type="aff" rid="aff003">
<sup>3</sup>
</xref>
<xref ref-type="author-notes" rid="fn001">
<sup></sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>Medical Library, NYU Langone Medical Center, New York, New York, United States of America</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</aff>
<aff id="aff003">
<label>3</label>
<addr-line>National Institutes of Health, Bethesda, Maryland, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Larivière</surname>
<given-names>Vincent</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>Université de Montréal, CANADA</addr-line>
</aff>
<author-notes>
<fn fn-type="conflict" id="coi001">
<p>
<bold>Competing Interests: </bold>
The authors have declared that no competing interests exist.</p>
</fn>
<fn fn-type="con" id="contrib001">
<p>Conceived and designed the experiments: KBR JRS MFH LSK JGM BLH. Performed the experiments: KBR JRS MFH LSK JGM BLH NBDAG. Analyzed the data: KBR JRS MFH LSK JGM BLH. Wrote the paper: KBR JRS MFH LSK JGM BLH.</p>
</fn>
<fn fn-type="other" id="fn001">
<p>¶ Membership of the NIH Big Data Annotator Group is listed in the Acknowledgments.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>kevin.read@nyumc.org</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>7</month>
<year>2015</year>
</pub-date>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<volume>10</volume>
<issue>7</issue>
<elocation-id>e0132735</elocation-id>
<history>
<date date-type="received">
<day>8</day>
<month>1</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>6</month>
<year>2015</year>
</date>
</history>
<permissions>
<license xlink:href="https://creativecommons.org/publicdomain/zero/1.0/">
<license-p>This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the
<ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/publicdomain/zero/1.0/">Creative Commons CC0</ext-link>
public domain dedication</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:type="simple" xlink:href="pone.0132735.pdf"></self-uri>
<abstract>
<sec id="sec001">
<title>Objective</title>
<p>This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are “invisible” or not deposited in a known repository.</p>
</sec>
<sec id="sec002">
<title>Methods</title>
<p>We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article.</p>
</sec>
<sec id="sec003">
<title>Results</title>
<p>About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects.</p>
</sec>
<sec id="sec004">
<title>Conclusion</title>
<p>In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a “dataset,” determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets.</p>
</sec>
</abstract>
<funding-group>
<funding-statement>This research was supported by the Intramural Research Program of the U.S. National Institutes of Health, National Library of Medicine (NLM) and in part by an appointment to the NLM Associate Fellowship Program sponsored by the National Library of Medicine and administered by the Oak Ridge Institute for Science and Education.</funding-statement>
</funding-group>
<counts>
<fig-count count="5"></fig-count>
<table-count count="8"></table-count>
<page-count count="18"></page-count>
</counts>
<custom-meta-group>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>The data analysis file and all annotator data files are available in the Figshare repository /m9.figshare.1285515. Read K. (2015). Sizing the Problem of Improving Discovery and Access to NIH-funded Data: A Preliminary Study (Datasets). Figshare. Available:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.6084/m9.figshare.1285515">http://dx.doi.org/10.6084/m9.figshare.1285515</ext-link>
.</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>The data analysis file and all annotator data files are available in the Figshare repository /m9.figshare.1285515. Read K. (2015). Sizing the Problem of Improving Discovery and Access to NIH-funded Data: A Preliminary Study (Datasets). Figshare. Available:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.6084/m9.figshare.1285515">http://dx.doi.org/10.6084/m9.figshare.1285515</ext-link>
.</p>
</notes>
</front>
<body>
<sec sec-type="intro" id="sec005">
<title>Introduction</title>
<p>Biomedical research is becoming increasingly data-centric. The proliferation of low-cost methods for whole genome sequencing, growing use of functional magnetic resonance imaging (fMRI) and other imaging modalities, and more widespread availability of clinical data in electronic health records (EHRs) are among the factors enabling biomedical researchers to generate and make use of increasing volumes of digital data in their research. Growth in the availability of biomedical data is, in turn, generating growing interest in improving the management and utilization of the many types of data (e.g., genomic, imaging, behavioral, clinical, exposure) that are used in biomedical research.</p>
<p>Improved management of biomedical research data—or any scientific data—can have many benefits. Fundamentally, improved management of scientific data is essential to the preservation of the scientific record, of which data are a growing part. It is also the basis for improved sharing of data, for example, enabling other researchers to have access to previously collected data. Data sharing can improve the quality and efficiency of research by allowing researchers to verify and validate prior research findings, to conduct research that combines previously collected data with newly collected data, and to compare the results of related research studies more easily [
<xref rid="pone.0132735.ref001" ref-type="bibr">1</xref>
,
<xref rid="pone.0132735.ref002" ref-type="bibr">2</xref>
].</p>
<p>Around the world, governments, research funding organizations, and investigators are actively pursuing better management of, and access to scientific research data [
<xref rid="pone.0132735.ref003" ref-type="bibr">3</xref>
]. The European Commission, European Research Council and Canadian Institute of Health Research have all established policies for research data [
<xref rid="pone.0132735.ref004" ref-type="bibr">4</xref>
<xref rid="pone.0132735.ref006" ref-type="bibr">6</xref>
]. In the United States, a February 2013 memorandum from the White House Office of Science and Technology Policy directed all U.S. federal science agencies that spend more than $100 million per year on research and development to develop plans to increase public access to digital data resulting from research funded by those agencies [
<xref rid="pone.0132735.ref007" ref-type="bibr">7</xref>
]. The U.S. Department of Health and Human Services issued its plans in February 2015 [
<xref rid="pone.0132735.ref008" ref-type="bibr">8</xref>
].</p>
<p>The U.S. National Institutes of Health (NIH), part of the Department of Health and Human Services, is the world’s largest funder of biomedical research. It invests approximately $30 billion per year in biomedical research, most of which is expended through competitive grants to more than 300,000 researchers at universities, medical schools, and other research institutions in every U.S. state and around the world [
<xref rid="pone.0132735.ref009" ref-type="bibr">9</xref>
]. The NIH Big Data to Knowledge (BD2K) initiative launched in 2013 aims to “enable biomedical scientists to capitalize more fully on the Big Data being generated by those research communities” [
<xref rid="pone.0132735.ref010" ref-type="bibr">10</xref>
]. One goal of BD2K is to develop effective and efficient mechanisms to enable the identification of, access to, and citation for biomedical data, bringing more data into the ecosystem of science and scholarship [
<xref rid="pone.0132735.ref011" ref-type="bibr">11</xref>
].</p>
<p>An important step in designing, developing, and implementing mechanisms to discover, access, and cite the biomedical data used in NIH-funded research is to characterize the number of new datasets generated annually by NIH-funded researchers, the types of data created, and the frequency of reuse of existing data. Of particular interest are “invisible” datasets—datasets that are not currently stored and made accessible via well-known, publicly accessible data repositories. Previous studies have estimated how much data is shared by analyzing a set number of journals [
<xref rid="pone.0132735.ref012" ref-type="bibr">12</xref>
,
<xref rid="pone.0132735.ref013" ref-type="bibr">13</xref>
], performed analyses on how often specific datasets were cited in the literature [
<xref rid="pone.0132735.ref014" ref-type="bibr">14</xref>
,
<xref rid="pone.0132735.ref015" ref-type="bibr">15</xref>
], and used complex algorithms to estimate the entire universe of data for a specific discipline [
<xref rid="pone.0132735.ref016" ref-type="bibr">16</xref>
]. While it has been shown that it is possible to make some estimates of the types of data that are currently deposited in known repositories, it is more challenging to estimate the number of datasets that are
<italic>not</italic>
publicly or systematically registered, deposited, or archived. Arguably, such datasets should be a primary focus of any effort to improve the discoverability and reuse of data because they are less discoverable and accessible than data deposited in a known repository.</p>
<p>We conducted a study to develop a preliminary estimate of the annual volume and types of datasets generated by NIH-funded researchers. This study was undertaken to inform initial NIH efforts to improve the discoverability of and access to biomedical datasets. For the purpose of this study, a dataset was defined as any collection of data (e.g., different type of measurement) that was generated or reused to inform the results described in an article.</p>
</sec>
<sec sec-type="materials|methods" id="sec006">
<title>Methods</title>
<p>Our approach to characterizing biomedical research datasets relied on an examination of datasets that are used or generated in the course of research that is reported in published journal literature. This approach misses datasets that are collected as part of a research project but are not reported in a publication. While little is known about the full extent of non-publication in biomedical research, recent work indicates that as long as four years after study completion, the results from approximately one-third of clinical trials registered in ClinicalTrials.gov remains unpublished [
<xref rid="pone.0132735.ref017" ref-type="bibr">17</xref>
]. There is also evidence that the availability and discoverability of research datasets declines rapidly with age [
<xref rid="pone.0132735.ref018" ref-type="bibr">18</xref>
]. To the extent that discovery of datasets may be enabled by linking data to associated journal articles, our approach was a reasonable first step toward quantification and characterization. We further restrict our analysis to datasets generated by NIH-funded research. While this does not represent all of biomedical research, it is research that is subject to U.S. policies that require expanded data sharing. This sample also covers a broad spectrum of biomedical research types, from basic to clinical research across a wide range of diseases, conditions, and systems and therefore was a good starting point for analysis. For clarity, the process taken to identify NIH-funded datasets via the published journal literature described below is also illustrated in
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
.</p>
<fig id="pone.0132735.g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.g001</object-id>
<label>Fig 1</label>
<caption>
<title>Diagram of process taken to identify NIH-funded datasets via the published journal literature (Including Results).</title>
</caption>
<graphic xlink:href="pone.0132735.g001"></graphic>
</fig>
<sec id="sec007">
<title>Identifying articles with datasets deposited in known repositories</title>
<p>To estimate the number of datasets generated annually from NIH-funded research and identify datasets that are not stored in a known repository, our analysis focused on NIH-funded articles that were published in 2011. These articles represented the most current complete set of articles for a given year at the time of our study. To retrieve these articles, we searched PubMed using the strategy illustrated below (
<xref rid="pone.0132735.t001" ref-type="table">Table 1-a</xref>
), which retrieved citations from articles published in 2011 that acknowledged research funding support from NIH (step 1a,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). Use of PubMed’s Publication Type [PT] and Grant [GR] search tags enabled us to focus the search on citations that received NIH funding. This search identified almost 120,000 citations. We further limited the search to include those citations indexed in MEDLINE (using the MEDLINE [sb] subset search tag). This step focuses the analysis on citations that have been fully indexed and contain additional information to indicate whether datasets used in the summarized research were deposited in a known data repository. This search (
<xref rid="pone.0132735.t001" ref-type="table">Table 1-b</xref>
) retrieved more than 113,000 articles. All bolded text in Tables
<xref rid="pone.0132735.t001" ref-type="table">1</xref>
<xref rid="pone.0132735.t005" ref-type="table">5</xref>
indicate search terms that were progressively added to the search string (step 1b,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
).</p>
<table-wrap id="pone.0132735.t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t001</object-id>
<label>Table 1</label>
<caption>
<title>PubMed searches identifying articles with funding support from the NIH.</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t001g" xlink:href="pone.0132735.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">a)</td>
<td align="left" rowspan="1" colspan="1">2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt])</td>
<td align="left" rowspan="1" colspan="1">119,415</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">b)</td>
<td align="left" rowspan="1" colspan="1">2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt])
<bold>AND medline [sb]</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<underline>113,089</underline>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0132735.t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t002</object-id>
<label>Table 2</label>
<caption>
<title>PubMed searches identifying when datasets were deposited in certain repositories (SI dataset).</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t002g" xlink:href="pone.0132735.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb]
<bold>AND (GDB [si] OR GENBANK [si] OR OMIM [si] OR PDB [si] OR PIR [si] OR RefSeq [si] OR SWISSPROT [si] OR ClinicalTrials.gov [si] OR ISRCTN [si] OR GEO [si] OR PubChem-Substance [si] OR PubChem-Compound [si] OR PubChem-BioAssay [si])</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<underline>3528</underline>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0132735.t003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t003</object-id>
<label>Table 3</label>
<caption>
<title>PubMed search identifying articles with the “Molecular Sequence Data” MeSH Heading (MSD dataset).</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t003g" xlink:href="pone.0132735.t003"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb]
<bold>NOT</bold>
(GDB [si] OR GENBANK [si] OR OMIM [si] OR PDB [si] OR PIR [si] OR RefSeq [si] OR SWISSPROT [si] OR ClinicalTrials.gov [si] OR ISRCTN [si] OR GEO [si] OR PubChem-Substance [si] OR PubChem-Compound [si] OR PubChem-BioAssay [si])
<bold>AND molecular sequence data [mh:noexp]</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<underline>3460</underline>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0132735.t004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t004</object-id>
<label>Table 4</label>
<caption>
<title>PubMed search identifying articles in PMC.</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t004g" xlink:href="pone.0132735.t004"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb]
<bold>NOT</bold>
(GDB [si] OR GENBANK [si] OR OMIM [si] OR PDB [si] OR PIR [si] OR RefSeq [si] OR SWISSPROT [si] OR ClinicalTrials.gov [si] OR ISRCTN [si] OR GEO [si] OR PubChem-Substance [si] OR PubChem-Compound [si] OR PubChem-BioAssay [si])
<bold>NOT</bold>
molecular sequence data [mh:noexp]
<bold>AND pubmed pmc all[sb]</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<underline>81604</underline>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<table-wrap id="pone.0132735.t005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t005</object-id>
<label>Table 5</label>
<caption>
<title>Removal of articles that were not considered “research”.</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t005g" xlink:href="pone.0132735.t005"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">2011 [dp] AND (NIH [gr] OR Research Support, N.I.H., Extramural [pt] OR Research Support, N.I.H., Intramural [pt]) AND medline [sb] NOT (GDB [si] OR GENBANK [si] OR OMIM [si] OR PDB [si] OR PIR [si] OR RefSeq [si] OR SWISSPROT [si] OR ClinicalTrials.gov [si] OR ISRCTN [si] OR GEO [si] OR PubChem-Substance [si] OR PubChem-Compound [si] OR PubChem-BioAssay [si]) NOT molecular sequence data [mh:noexp] AND pubmed pmc all[sb]
<bold>NOT review [pt] NOT letter [pt] NOT news [pt] NOT editorial [pt]</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<underline>71910</underline>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>From this set of articles, we identified those that indicated when authors shared their data in a specific repository. This process began by searching for articles that had a Secondary Source Identifier [SI][
<xref rid="pone.0132735.ref019" ref-type="bibr">19</xref>
]; this identifier indicates when the author of an article has deposited his/her data in one of the specific repositories that are recognized in MEDLINE/PubMed. The repositories that can be designated in the SI field include, but are not limited to: ClinicalTrials.gov, PubChem, Johns Hopkins University Genome Data Bank, Gene Expression Omnibus, GenBank, ISRCTN Register, Mendelian Inheritance in Man, Protein Data Bank, Protein Identification Resource, Reference Sequence, and SWISSPROT Protein Sequence Database. The aforementioned repositories provide evidence of how many articles acknowledge data deposition in each of these locations within a given year (
<xref rid="pone.0132735.t002" ref-type="table">Table 2</xref>
). This step identified 3,528 (
<bold>SI dataset</bold>
) articles that had deposited data into one of the listed repositories (step 2,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
).</p>
<p>We removed the
<bold>SI dataset</bold>
articles from our article set and searched the remaining articles for the Medical Subject Heading (MeSH) “Molecular Sequence Data.” Citations tagged with this MeSH heading are those for which data are likely to be deposited in GenBank or an equivalent repository. For the year 2011 this search identified 3,460 (
<bold>MSD dataset</bold>
) articles that provided indication that data should have been deposited in a repository (
<xref rid="pone.0132735.t003" ref-type="table">Table 3</xref>
; step 3,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
).</p>
<p>We removed the
<bold>MSD dataset</bold>
articles from the sample and searched the remaining articles for those with full-text available in PubMed Central (PMC) using the [sb] search tag. This allowed us to conduct further analysis on information that would be provided only in the full-text of an article (as opposed to the MEDLINE citation) (
<xref rid="pone.0132735.t004" ref-type="table">Table 4</xref>
; step 4,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
).</p>
<p>We reduced the set of articles by identifying and removing all non-research articles, meaning those with the MEDLINE publication type [PT] of review, editorial, news, and letter. This step created a sample that included only full-text research articles with MEDLINE records that did not mention depositing data into a repository (
<xref rid="pone.0132735.t005" ref-type="table">Table 5</xref>
). This process resulted in a sample of 71,910 articles (step 5,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
).</p>
<p>We then examined the articles to identify those that mention the sharing of their data in the acknowledgments section of an article, using the Acknowledgements search field [
<xref rid="pone.0132735.ref020" ref-type="bibr">20</xref>
] of PMC (
<xref rid="pone.0132735.g002" ref-type="fig">Fig 2</xref>
) [
<xref rid="pone.0132735.ref021" ref-type="bibr">21</xref>
]. The Acknowledgments section of a full-text article is often used to indicate when data have been shared in a specific repository. We selected the NIH Data Sharing Repositories Web page [
<xref rid="pone.0132735.ref022" ref-type="bibr">22</xref>
] as our gold standard to gather a list of NIH-specific data repositories, and used keyword variations and acronyms (e.g., Gene Expression Omnibus, GEO, Protein Data Bank, PDB) to search each repository in the Acknowledgments field in PMC with the [ack] search tag for the year 2011. Additionally, the terms “DataCite” and “Dryad” were added to the strategy, seeking occurrences in any PMC search field, because they are well-known resources for discovery of scientific data, including data referenced in scientific journal articles.</p>
<fig id="pone.0132735.g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.g002</object-id>
<label>Fig 2</label>
<caption>
<title>Example of the PubMed Central Acknowledgments where the authors have indicated the deposit of data in a specific repository; PMCID: PMC4085032.</title>
</caption>
<graphic xlink:href="pone.0132735.g002"></graphic>
</fig>
<p>This search identified 814 (
<bold>ACK dataset</bold>
) articles that mentioned one or more of the recognized repositories. After accounting for overlap with the
<bold>SI dataset</bold>
and
<bold>MSD dataset</bold>
, we removed 230 articles in the
<bold>ACK dataset</bold>
from our sample set, leaving us with 71,680 articles that made no mention that their data were deposited in a known repository (step 6,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). Of these articles, 198 were not yet available in PMC at the time of our study, so they were removed from the sample, leaving 71,482 articles (step 7,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
).</p>
<p>The final procedure used to identify articles that mention the deposit of data was to scan for the same keyword variations and acronyms from the 45 NIH data repositories within the XML full-text data for the remaining articles [
<xref rid="pone.0132735.ref023" ref-type="bibr">23</xref>
]. This step aimed to fill in any gaps from the two previous strategies and to search beyond the scope of the Acknowledgments field in PMC to find additional mentions of data repositories. It was only possible to perform this search on 10,418 articles for which full-text XML was available via the PMC Open Access Subset [
<xref rid="pone.0132735.ref024" ref-type="bibr">24</xref>
]; the PMC Open Access Subset includes articles that are still protected by copyright, but are made available via a Creative Commons or similar license that provides for more liberal distribution and reuse of the copyrighted work. This method identified 1,825 articles (
<bold>XML dataset</bold>
) in total that mentioned a data repository (step 8,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). We removed these articles from the sample leaving a total number of 69,657 NIH-funded articles that contained “invisible” datasets (
<xref rid="pone.0132735.t006" ref-type="table">Table 6</xref>
).</p>
<table-wrap id="pone.0132735.t006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t006</object-id>
<label>Table 6</label>
<caption>
<title>Breakdown for subtraction of articles that mention the deposit of data.</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t006g" xlink:href="pone.0132735.t006"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Procedure taken</th>
<th align="left" rowspan="1" colspan="1">Articles identified</th>
<th align="left" rowspan="1" colspan="1">Articles remaining</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">1a. NIH-funded articles for 2011 in PubMed</td>
<td align="left" rowspan="1" colspan="1">
<bold>--</bold>
</td>
<td align="left" rowspan="1" colspan="1">119,415</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">1b. NIH-funded articles for 2011 indexed for MEDLINE</td>
<td align="left" rowspan="1" colspan="1">6,326</td>
<td align="left" rowspan="1" colspan="1">113,089</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">2. Articles with repository in [SI] field
<bold>(SI dataset)</bold>
</td>
<td align="left" rowspan="1" colspan="1">3,528</td>
<td align="left" rowspan="1" colspan="1">109,561</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">3. Articles with Molecular Sequence Data MeSH Heading
<bold>(MSD dataset)</bold>
</td>
<td align="left" rowspan="1" colspan="1">3,460</td>
<td align="left" rowspan="1" colspan="1">106,101</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">4. PubMed cited articles not available in PMC</td>
<td align="left" rowspan="1" colspan="1">24,497</td>
<td align="left" rowspan="1" colspan="1">81,604</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">5. Non-research articles</td>
<td align="left" rowspan="1" colspan="1">9,694</td>
<td align="left" rowspan="1" colspan="1">71,910</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">6. Articles with repository in PMC Acknowledgements
<bold>(ACK dataset)</bold>
</td>
<td align="left" rowspan="1" colspan="1">230</td>
<td align="left" rowspan="1" colspan="1">71,680</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">7. Additional articles not available in PMC</td>
<td align="left" rowspan="1" colspan="1">198</td>
<td align="left" rowspan="1" colspan="1">71,482</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">8. Articles with repository in full-text XML (of 10,418 searched)
<bold>(XML dataset)</bold>
</td>
<td align="left" rowspan="1" colspan="1">1,825</td>
<td align="left" rowspan="1" colspan="1">69,657</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Total remaining articles used for subsequent analysis</td>
<td align="left" rowspan="1" colspan="1">
<underline>
<bold>--</bold>
</underline>
</td>
<td align="left" rowspan="1" colspan="1">
<underline>
<bold>69, 657</bold>
</underline>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>The mention of a dataset or repository in the body of the full-text does not necessarily mean that the data are deposited in the repository; it confirms only the presence of the term(s) in the article, not the context. To estimate the frequency with which a mention of a repository corresponds to the actual deposit of a dataset, we extracted a random subsample of 180 articles from the
<bold>XML dataset</bold>
(step 9,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). Two reviewers independently examined each article in the subsample to determine whether or not the data had been deposited in a mentioned repository. The reviewers first examined the context surrounding the mention of the data repository and then, if necessary, the full text of the article. If a determination could not be made by either of these two methods, the reviewers checked the named data repositories for evidence that the data had been deposited. Following the independent reviews, the reviewers met to agree on the final determination for each article.</p>
</sec>
<sec id="sec008">
<title>Analysis of articles with “invisible” datasets</title>
<p>To analyze the 69,657 articles containing “invisible” datasets, we extracted a random sample of 385 articles (confidence interval 95%) for further analysis (step 10,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). Thirty members of NIH staff were recruited to annotate and analyze the datasets reported in the 385 articles. Annotators were subject experts working in a variety of disciplines including MEDLINE indexers of biomedical literature, biomedical informaticians, physicians, neuroscientists, molecular biologists, librarians, and organizational directors. Each annotator was assigned 25 articles through randomization, and two participants were assigned the same 25 articles—a total of 16 sets—to provide a means to measure the reliability of the counts. One set of annotators only analyzed 10 articles, as this set represented the remaining balance after articles were assigned to other annotators. Each annotator was asked to review his or her assigned articles in their entirety and answer questions related to each dataset described therein. Annotators were instructed to look closely at the methodology section of the paper and any figures or tables to determine how many different measurements were taken. Annotators received a guideline document that included a set of questions to be answered for each assigned article. There was a list of controlled terms for anticipated answers to some of the questions. The guidelines and controlled terms went through several iterations including a pilot study and several internal reviews to improve the clarity of what was being asked and enhance the comparability of the results between annotators. The series of questions are listed in
<xref rid="pone.0132735.g003" ref-type="fig">Fig 3</xref>
.</p>
<fig id="pone.0132735.g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.g003</object-id>
<label>Fig 3</label>
<caption>
<title>Questions for annotating datasets contained in research articles.</title>
</caption>
<graphic xlink:href="pone.0132735.g003"></graphic>
</fig>
<p>The categories used to describe the type of data collected in each dataset were developed by the authors of this paper, based on their knowledge of various types of data collected in biomedical research. They do not reflect any particular standard for classifying dataset types. These data types were also informed by an earlier pilot study, and consultation with a variety of stakeholders within the NLM including leadership, indexers, bioinformaticians, and ontologists.</p>
<p>Annotators were asked to populate a spreadsheet with their answers and create a row in the spreadsheet for each dataset found within an article. This procedure provided an opportunity to count how many datasets were created per article, and understand the different types of data that were collected per article. Once annotators completed their 25 articles, the results were returned for review and analysis.</p>
</sec>
</sec>
<sec sec-type="results" id="sec009">
<title>Results</title>
<p>We first summarize the results of our analysis of datasets in known repositories and then present the results of our analysis of the invisible datasets (
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
).</p>
<sec id="sec010">
<title>Datasets in known repositories</title>
<sec id="sec011">
<title>SI Dataset</title>
<p>The use of the SI field identified journal articles for which a dataset had been deposited in a specified repository (step 2,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). It provided valuable information about the common locations from which data are frequently shared. From the original sample of 113,089 MEDLINE citations, more than 3,500 (3.1%) listed data repositories in the
<bold>SI dataset</bold>
. The most common repositories where data were deposited were ClinicalTrials.gov, Protein Data Bank, Gene Expression Omnibus, and GenBank (
<xref rid="pone.0132735.g004" ref-type="fig">Fig 4</xref>
).</p>
<fig id="pone.0132735.g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.g004</object-id>
<label>Fig 4</label>
<caption>
<title>Repositories identified from the PubMed SI field and PMC Acknowledgements where datasets were deposited.</title>
</caption>
<graphic xlink:href="pone.0132735.g004"></graphic>
</fig>
</sec>
<sec id="sec012">
<title>ACK Dataset</title>
<p>Review of the PMC Acknowledgements field yielded results similar to the SI field search (step 6,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). The
<bold>ACK dataset</bold>
(n = 814) articles acknowledged or mentioned a recognized data repository in more than 3,200 instances, for an average of almost 4 datasets per paper. Protein Data Bank, ClinicalTrials.gov, GenBank, and GEO were again the most common repositories where data were being shared (
<xref rid="pone.0132735.g004" ref-type="fig">Fig 4</xref>
). Because the Acknowledgments search identified a wider range of data repositories than were captured in the SI field in 2011, we were able to gain a better understanding of how often other repositories are used. For example, the Influenza Research Database (IRD), Mouse Genome Informatics (MGI) repository, Database of Interacting Proteins (DIP), and Flybase were the most heavily used data repositories beyond the Protein Data Bank (PDB) and databases managed by NLM. This finding provides insight into the frequency of use of these repositories in a given year (
<xref rid="pone.0132735.g004" ref-type="fig">Fig 4</xref>
).</p>
</sec>
<sec id="sec013">
<title>XML Dataset</title>
<p>The final step to identify journal articles that mention a data repository, the XML method, identified 1,825 additional articles that mentioned a data repository somewhere in the text other than the Acknowledgments section (step 8,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). As noted in the methodology, this analysis was performed on only 10,418 publications from the PMC Open Access Subset, meaning that 17.5% of the articles analyzed were found to mention a dataset. This finding strongly suggests that any future reviews should be expanded beyond the Acknowledgements section to the entire text of an article. The repositories mentioned in the full-text XML aligned with those identified in the
<bold>SI</bold>
and
<bold>ACK datasets</bold>
. GenBank, Protein Data Bank, and Gene Expression Omnibus were again the most prominent data repositories mentioned (
<xref rid="pone.0132735.g005" ref-type="fig">Fig 5</xref>
). This method also identified the long tail of deposits in a number of other specialized repositories. One limitation of this approach is that when multiple data repositories were mentioned in the same article, the XML program could not count the separate repositories individually. This caused all articles that mentioned more than one repository to be categorized together. In total, 29% of all the articles mentioned more than one repository.</p>
<fig id="pone.0132735.g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.g005</object-id>
<label>Fig 5</label>
<caption>
<title>Keywords identified from full-text XML data mining.</title>
</caption>
<graphic xlink:href="pone.0132735.g005"></graphic>
</fig>
<p>In estimating how often the mention of a repository in the XML meant that a dataset had actually been deposited in the mentioned repository, the reviewers determined that 33% of the 180 article subsample taken from the
<bold>XML dataset</bold>
reflected an actual deposit of data into a data repository (step 9,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). The remainder fell into four main categories: those that used data from a repository (47%); those that mentioned a repository as background information based on previous research (6%); those that discussed a repository as the subject of the article (4%); or those that used an ambiguous data repository acronym (10%) (e.g., the acronym RGD for Rat Genome Database is also used to describe arginyl-glycyl-aspartic acid). Extrapolating these findings to our larger set, we estimate that 598 articles in the
<bold>XML dataset</bold>
(33% of 1,825 articles) would refer to the deposit of data into a named repository. This figure is equal to 5.7% of all the articles examined from the PMC Open Access Subset.</p>
<p>The findings from these various analyses provide a rough estimate of the fraction of NIH-funded research articles that indicate that data were deposited in a known public repository (
<xref rid="pone.0132735.t007" ref-type="table">Table 7</xref>
). From the larger sample of MEDLINE citations that referenced NIH support in 2011, we found that 3.1% had an SI field (
<bold>SI dataset</bold>
) that indicated deposit in a known repository (step 2,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). From the remaining set of articles, another 3.2% had molecular data (
<bold>MSD dataset</bold>
) that would likely be deposited in GenBank or an equivalent repository (step 3,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). Of the research articles for which full-text was available, 0.3% included a unique acknowledgement of a data repository (
<bold>ACK dataset</bold>
) that was discoverable by searching using the [ack] search tag in PMC (step 6,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). Of articles examined from the PMC Open Access Subset (
<bold>XML dataset),</bold>
an estimated 5.7% reference the deposit of data into a named repository (step 9,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). Because the percentages are taken from different subsamples of the larger population of NIH-funded articles in 2011, they cannot be strictly summed. As a rough measure, however, they suggest that an estimated 12.3% (
<xref rid="pone.0132735.t007" ref-type="table">Table 7</xref>
) of the articles published by NIH-funded investigators in 2011 referred to datasets that are, or may be, stored in a known publicly accessible data repository. The datasets in the remaining 88% of published journal articles were “invisible.”</p>
<table-wrap id="pone.0132735.t007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t007</object-id>
<label>Table 7</label>
<caption>
<title>Estimated Number of Articles with a Dataset Stored in a Known Repository.</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t007g" xlink:href="pone.0132735.t007"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Procedure taken</th>
<th align="left" rowspan="1" colspan="1">Articles Examined</th>
<th align="left" rowspan="1" colspan="1">Articles identified</th>
<th align="left" rowspan="1" colspan="1">% of Examined Articles</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">[SI] field (
<bold>SI dataset</bold>
)</td>
<td align="left" rowspan="1" colspan="1">113,089</td>
<td align="left" rowspan="1" colspan="1">3,528</td>
<td align="char" char="." rowspan="1" colspan="1">3.1%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Molecular Sequence Data MeSH Heading (
<bold>MSD dataset</bold>
)</td>
<td align="left" rowspan="1" colspan="1">109,561</td>
<td align="left" rowspan="1" colspan="1">3,460</td>
<td align="char" char="." rowspan="1" colspan="1">3.2%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">PMC Acknowledgements (
<bold>ACK dataset</bold>
)</td>
<td align="left" rowspan="1" colspan="1">71,910</td>
<td align="left" rowspan="1" colspan="1">230</td>
<td align="char" char="." rowspan="1" colspan="1">0.3%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Full-text XML (
<bold>XML dataset</bold>
)</td>
<td align="left" rowspan="1" colspan="1">10,418</td>
<td align="left" rowspan="1" colspan="1">598</td>
<td align="char" char="." rowspan="1" colspan="1">5.7%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">
<bold>Estimated Total</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold></bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold></bold>
</td>
<td align="char" char="." rowspan="1" colspan="1">
<bold>12.3%</bold>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
</sec>
</sec>
<sec id="sec014">
<title>“Invisible” datasets</title>
<p>Our analysis of the 385 journal articles without a discoverable reference to a data repository is summarized in
<xref rid="pone.0132735.t008" ref-type="table">Table 8</xref>
and highlights the challenges in defining and counting datasets. Eight of the 30 annotators defined a dataset as consisting of all of the data resulting from an article, irrespective of the different types of data involved (e.g., chemical test results, imaging data). This meant that half of the sets of articles had only one review that was consistent with our proposed methodology, rather than the two desired. As a result, each set of articles that included only one valid review was removed from the final analysis, leaving 8 sets of articles for analysis (step 11,
<xref rid="pone.0132735.g001" ref-type="fig">Fig 1</xref>
). This reduced sample of 200 articles yielded a confidence interval of 84.3% for our subsequent estimates.</p>
<table-wrap id="pone.0132735.t008" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0132735.t008</object-id>
<label>Table 8</label>
<caption>
<title>Summary of Analysis of Invisible Datasets.</title>
</caption>
<alternatives>
<graphic id="pone.0132735.t008g" xlink:href="pone.0132735.t008"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" valign="middle" span="1"></col>
<col align="left" valign="middle" span="1"></col>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="1" colspan="1">Measure</th>
<th align="left" rowspan="1" colspan="1">Finding</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2" align="left" colspan="1">Average number of datasets per article</td>
<td align="left" rowspan="1" colspan="1">All articles reviewed: 3.4 per article</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Articles with two reviews: 2.9 per article</td>
</tr>
<tr>
<td rowspan="2" align="left" colspan="1">Type of subject</td>
<td align="left" rowspan="1" colspan="1">Human subjects: 28.3%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Non-human animal subjects: 26.1%</td>
</tr>
<tr>
<td rowspan="2" align="left" colspan="1">New vs. existing data</td>
<td align="left" rowspan="1" colspan="1">New datasets: 87%</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Existing datasets: 13%</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>For these sets of articles with two reviews, there were substantial differences between annotators in the number of datasets identified and described. Within a set of articles, the average difference between the annotators who identified high and low numbers of datasets was 43%—a significantly high percent difference with respect to the validation of this exercise. While the percentage differences are large, however, the absolute numbers are small, and most pairs differed by only one or two datasets. Only one set of annotators counted widely divergent numbers of datasets in their sample.</p>
<p>The analysis nevertheless provides insight into the number of datasets per article. Considering only the eight sets of articles for which there were two valid reviews, the average number of datasets counted per article was 2.9 (
<xref rid="pone.0132735.t008" ref-type="table">Table 8</xref>
). When all sets of annotations were evaluated (including those sets where annotators counted only one dataset per article), the average number increased to 3.4 datasets per article, reflecting the fact that some of the publications in the additional sets of articles were reported to contain high numbers of datasets. An average of between 2.9 and 3.4 datasets per article aligns with the estimated four datasets per paper we found in our methods for identifying datasets within the
<bold>ACK dataset</bold>
.</p>
<p>There was greater consistency between pairs of annotators in identifying data from human subjects and live non-human animals. The average percentage of articles identified as reporting research involving human subjects was 28.3%, and the percentage identified as involving non-human animals was 26.1% (
<xref rid="pone.0132735.t008" ref-type="table">Table 8</xref>
).</p>
<p>The last phase of analysis determined how much of the data that was used in the course of NIH-funded research published in 2011 was new data, and how much was pre-existing data. We counted the percentage of new versus pre-existing data for each set of articles and then calculated the total percentage from all. Annotators were very consistent in making this determination. Combining results from all annotators, we estimate that 87% of the articles involved the collection of new data, and 13% involved the analysis of pre-existing data (
<xref rid="pone.0132735.t008" ref-type="table">Table 8</xref>
). While new data were collected for purposes of the research reported in the article, pre-existing data included data from previously conducted clinical trials (e.g., reanalysis of the clinical trial data) or surveys (e.g., at local, regional, or national level), among other sources. Some articles made use of both new and pre-existing data.</p>
<p>Annotators were inconsistent in their ability to assign data types from our controlled list of categories to datasets found within articles. Few annotators chose to use the controlled list; most preferred to use the “Other” option to describe the datasets they found, highlighting the difficulty in establishing a suitable classification for biomedical data types (
<xref rid="pone.0132735.g003" ref-type="fig">Fig 3</xref>
).</p>
</sec>
</sec>
<sec sec-type="conclusions" id="sec015">
<title>Discussion</title>
<p>This study is an initial step toward estimating the additional resources and infrastructure that will be needed to support expanding mandates to make data resulting from NIH-funded and other research available for use by researchers and the public. Methods for discoverability, access, and citation will need to be able to scale in a cost-effective manner if they are to include all such datasets used in published NIH-funded research studies, let alone all datasets used in published biomedical research regardless of funder. It is likely that an increasing number of biomedical datasets will be deposited in general purpose repositories (e.g., Dryad, Figshare), which will affect strategies for enhancing discoverability and access. Data citations are still uncommon and not frequently used by the scientific community. One study found that 88.1% of datasets within the Data Citation Index remained uncited [
<xref rid="pone.0132735.ref025" ref-type="bibr">25</xref>
]. Another study found that even a national data center rarely could identify formal citations of their data [
<xref rid="pone.0132735.ref026" ref-type="bibr">26</xref>
]. Without the ability to link and connect research datasets across multiple platforms, discovery and access will remain an issue.</p>
<p>Our results suggest that datasets referenced in only about 12% of articles reporting NIH-funded research in 2011 were (or were eligible to be) deposited in a known, publicly accessible data repository. Our estimates echo findings from previous studies that indicate a large portion of datasets are not shared. In one study, only 9% of articles from high-impact journals deposited their full dataset (including raw data) online [
<xref rid="pone.0132735.ref012" ref-type="bibr">12</xref>
]. Another study found that in their sample not a single study cited a dataset with a unique identifier, therefore providing no indication that the data are shared anywhere [
<xref rid="pone.0132735.ref013" ref-type="bibr">13</xref>
]. Our analysis also found that an expected 858 articles (47% of 1,825 articles) in the
<bold>XML dataset</bold>
mention the use of data from a known repository (rather than a deposit into a repository). This figure is equal to 8.2% of the articles we examined from the PMC Open Access Subset and provides a measure of the level of reuse of data from known repositories.</p>
<p>Our findings also help characterize the invisible datasets from biomedical research that are not deposited in a known repository. We found 69,657 articles that were published in 2011 and reported on NIH-funded research but did not indicate that data had been deposited in a known repository. We estimate that the research described in these articles used an average of 2.9 to 3.4 datasets per article. These figures mean that approximately 200,000 to 235,000 datasets were used in NIH-funded research published in 2011 but not deposited in one of the well-known public repositories for specific categories of biomedical data (e.g., GEO, GenBank, Protein DataBank, ClinicalTrials.gov).</p>
<p>Given that as many as 88% of biomedical research datasets may not currently be deposited in a well-known, public data repository, the problem of improving the discoverability of biomedical datasets remains significant. The basic challenge is therefore deciding which data are most worthy of the additional resources and effort that will inevitably be required to make them readily discoverable and accessible. Arguably, useful and manageable data discovery systems should focus on datasets that show potential for reuse or that point to significant findings so that underlying data should be available to others to assess validity and accuracy. A strong case can be made that datasets derived from live subjects have a particularly high priority for discoverability and accessibility. Making such datasets readily available could reduce the need to expose additional live subjects to potential risks and, in the case of human subjects, help to meet the ethical obligation to ensure that their participation in research studies adds to scientific knowledge.</p>
<p>Another point to consider is
<italic>how</italic>
data can be best discovered, accessed, and understood. Our review of datasets suggests that while some datasets (such as those stored in known repositories like ClinicalTrials.gov, GenBank, and Protein DataBank) can stand on their own and serve as resources for other investigators, many datasets may have limited utility outside the study for which they were collected. These datasets may be meaningful only when considered alongside other datasets collected for the same study and in conjunction with the journal article that summarizes them. This observation has significant implications for data discovery and storage, because it suggests that in some cases the preferred discovery tool may be the publication where the datasets are described rather than a separate mechanism that would find and retrieve them individually and independently. This argument has already been addressed within the scientific community, with some calling for an advanced publication where the underlying data can be extracted directly from the paper [
<xref rid="pone.0132735.ref027" ref-type="bibr">27</xref>
<xref rid="pone.0132735.ref030" ref-type="bibr">30</xref>
]. Nanopublications are another development that shed light on providing context for datasets pulled from a scientific paper; these abridged data publications provide narrative descriptions of data pulled directly from an article [
<xref rid="pone.0132735.ref031" ref-type="bibr">31</xref>
]. It is our belief that including dataset metadata summaries within the published article may be an efficient way to promote the discovery of these datasets.</p>
<p>Further work is also needed to determine how to define a dataset. As evidenced by the lack of consistency between annotators with respect to the number of datasets they identified, there are differences in perceptions of what constitutes a dataset and in how well data collected or used in a research study are described in journal articles. Depending on one’s perspective, a single dataset could be: all of the data that is collected or used in a study; all data collected at a specific time within a study; pre- or post-intervention; a discrete type of data from a specific diagnostic device; or even every individual measurement reported in a research article. Data access and sharing requirements must clearly define a dataset to outline expectations of what researchers will be required to share and submit and what will be available to potential users. Requirements are likely to vary depending on the use (or reuse) cases for different types of data.</p>
<p>Data creation and analysis pipelines raise additional questions about how data should be described and at what point along the pipeline. Collected data go through multiple processing transformations during analysis. As a simple example, an image may be collected of a cell (e.g., an optical image); that image may be analyzed by measuring the size of certain structures in the cell (e.g., numerical data); the numerical/structural data from multiple cells may be aggregated for analysis (e.g., to compare the size of structures in treated versus untreated cells). Results of that analysis may be shown in a table or a graph, perhaps showing trends in size of the structures in the treated and untreated cells over time. For a researcher interested in reproducing the research, the basic imaging data may be of most interest, but such a researcher might also want or need to know how the data were reduced and need access to associated data processing algorithms. For a researcher interested in comparing results across studies, the more processed data may be of most interest. For a researcher interested in reusing the data, the data from a particular point along the data processing pipeline might be most useful. Providing data at each step along the pipeline might prove to be onerous or overly complex for data generators and those who want to make use of the data.</p>
<p>Any system for data discovery and access must describe data in a way that will be useful for those researchers, health professionals or members of the public who are interested in reviewing biomedical data. An examination of a variety of metadata schemas [
<xref rid="pone.0132735.ref032" ref-type="bibr">32</xref>
<xref rid="pone.0132735.ref034" ref-type="bibr">34</xref>
] and the metadata employed in existing NIH data repositories indicates that the baseline description of datasets in current repositories does not differ greatly from descriptive metadata for journal articles or archival objects. However, we are not aware of strong evidence that the current metadata schemes applied to biomedical datasets either do or do not meet the needs of researchers seeking data to reuse. There has been considerable discussion about including enriched metadata to make data more discoverable in the context of a data publication to provide detailed metadata and description of individual datasets [
<xref rid="pone.0132735.ref035" ref-type="bibr">35</xref>
,
<xref rid="pone.0132735.ref036" ref-type="bibr">36</xref>
]. Some research has begun to examine the quality of metadata used in scientific data repositories [
<xref rid="pone.0132735.ref037" ref-type="bibr">37</xref>
], but more research is needed to determine what metadata would enable efficient discovery of various types of data. Analysis of current use patterns of existing repositories that accommodate disparate datasets may shed light on what types of data and descriptive metadata are most useful.</p>
<p>Determining which types of biomedical data have the highest reuse value, how to describe them usefully and cost-effectively, and how to make them accessible in a sustainable way are key challenges for the NIH and its recently established Data Discovery Index Coordination Consortium [
<xref rid="pone.0132735.ref038" ref-type="bibr">38</xref>
] as they move forward to make biomedical big data more discoverable, accessible, and citable.</p>
</sec>
<sec sec-type="conclusions" id="sec016">
<title>Conclusion</title>
<p>These findings represent a first look into the landscape of NIH-funded data. An understanding of the varying types of data that are created throughout the course of biomedical research and the knowledge that a substantial amount of new data is created per article in a given year will help to inform efforts to improve the discoverability and accessibility of digital biomedical research data. Differences in perspective encountered among participants in the study suggest that the creation of data discovery tools for biomedical research data will not be straightforward. Decisions will have to be made as to what data will be selected for description and careful consideration will need to be given to identifying how to describe datasets derived from NIH-funded research.</p>
</sec>
</body>
<back>
<ack>
<p>Investigators (institution and location) in the NIH Big Data Annotator Group include (in alphabetical order): Swapna Abhyankar (National Library of Medicine, Bethesda, MD), Olubumi Akiwumi (Oregon Health & Science University, Portland, OR), Olivier Bodenreider (National Library of Medicine, Bethesda, MD), Sally Davidson (National Library of Medicine, Bethesda, MD), Dina Demner Fushman (Library of Medicine, Bethesda, MD), Tracy Edinger (Kaiser Permanente, Portland, OR), Greg Farber (National Institute of Mental Health, Bethesda, MD), Karen Gutzman (Bernard Becker Medical Library, Chicago, IL), Mary Ann Hantakas (National Library of Medicine, Bethesda, MD), Preeti Kochar (National Library of Medicine, Bethesda, MD), Jennie Larkin (National Heart Lung and Blood Institute, Bethesda, MD), Peter Lyster (National Institute of General Medical Sciences, Bethesda, MD), Matt McAuliffe (Federal Interagency Traumatic Brain Injury Research Informatics System, Bethesda, MD), Shari Mohary (National Library of Medicine, Bethesda, MD), Helen Ochej (National Library of Medicine, Bethesda, MD), Olga Printseva (National Library of Medicine, Bethesda, MD), Oleg Rodionov (National Library of Medicine, Bethesda, MD), Laritza Rodriguez (National Library of Medicine, Bethesda, MD), Suzy Roy (National Library of Medicine, Bethesda, MD), Susan Schmidt (National Library of Medicine, Bethesda, MD), Sonya Shooshan (National Library of Medicine, Bethesda, MD), Matthew Simpson (National Library of Medicine, Bethesda, MD), Corinn Sinnot (National Library of Medicine, Bethesda, MD), Samantha Tate (National Library of Medicine, Bethesda, MD), Janice Ward (National Library of Medicine, Bethesda, MD), Melissa Yorks (National Library of Medicine, Bethesda, MD).</p>
<p>We gratefully acknowledge Lori Klein, National Library of Medicine for her assistance with the preparation of the References list.</p>
<p>This research was supported by the Intramural Research Program of the U.S. National Institutes of Health, National Library of Medicine (NLM) and in part by an appointment to the NLM Associate Fellowship Program sponsored by the National Library of Medicine and administered by the Oak Ridge Institute for Science and Education.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0132735.ref001">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chan</surname>
<given-names>AW</given-names>
</name>
,
<name>
<surname>Song</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Vickers</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Jefferson</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Dickersin</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Gøtzsche</surname>
<given-names>PC</given-names>
</name>
,
<etal>et al</etal>
<article-title>Increasing value and reducing waste: addressing inaccessible research</article-title>
.
<source>Lancet</source>
.
<year>2014</year>
<month>1</month>
<day>18</day>
;
<volume>383</volume>
(
<issue>9913</issue>
):
<fpage>257</fpage>
<lpage>66</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/S0140-6736(13)62296-5">10.1016/S0140-6736(13)62296-5</ext-link>
</comment>
Epub 2014 Jan 8. . Accessed 20 Feb 2015.
<pub-id pub-id-type="pmid">24411650</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref002">
<label>2</label>
<mixed-citation publication-type="journal">
<name>
<surname>Névéol</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Wilbur</surname>
<given-names>WJ</given-names>
</name>
,
<name>
<surname>Lu</surname>
<given-names>Z</given-names>
</name>
.
<article-title>Extraction of data deposition statements from the literature: a method for automatically tracking research results</article-title>
.
<source>Bioinformatics</source>
.
<year>2011</year>
<month>12</month>
<day>1</day>
;
<volume>27</volume>
(
<issue>23</issue>
):
<fpage>3306</fpage>
<lpage>12</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1093/bioinformatics/btr573">10.1093/bioinformatics/btr573</ext-link>
</comment>
Epub 2011 Oct 13. ; PubMed Central PMCID: PMC3223368. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3223368/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3223368/</ext-link>
. Accessed 9 June 2015.
<pub-id pub-id-type="pmid">21998156</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref003">
<label>3</label>
<mixed-citation publication-type="other">OECD. [Paris]: Organization of Economic Co-operation and Development. Open science; [accessed 2015 Jun 10]. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.oecd.org/sti/outlook/e-outlook/stipolicyprofiles/interactionsforinnovation/openscience.htm">http://www.oecd.org/sti/outlook/e-outlook/stipolicyprofiles/interactionsforinnovation/openscience.htm</ext-link>
. Accessed 10 Jun 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref004">
<label>4</label>
<mixed-citation publication-type="other">EU Framework Programme for Research and Innovation. Guidelines on open access to scientific publications and research data in Horizon 2020. Version 16. [place unknown]: European Commission; 2013 Dec. 14 p. Available:
<ext-link ext-link-type="uri" xlink:href="http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf">http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-pilot-guide_en.pdf</ext-link>
. Accessed 4 Mar 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref005">
<label>5</label>
<mixed-citation publication-type="other">European Research Council, Scientific Council. Open access guidelines for research results funded by the ERC. [place unknown]: European Research Council; revised 2014 Dec. 3 p. Available:
<ext-link ext-link-type="uri" xlink:href="http://erc.europa.eu/sites/default/files/document/file/ERC_Open_Access_Guidelines-revised_2014.pdf">http://erc.europa.eu/sites/default/files/document/file/ERC_Open_Access_Guidelines-revised_2014.pdf</ext-link>
. Accessed 15 Mar 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref006">
<label>6</label>
<mixed-citation publication-type="other">Tri-Agency open access policy on publications. [Ottawa (ON)]: Government of Canada, Public Works and Government Services Canada Publishing and Depository Services; 2015 [modified 2015 Feb 27; accessed 2015 Mar 12]. [about 3 p.]. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.science.gc.ca/default.asp?lang=En&n=F6765465-1">http://www.science.gc.ca/default.asp?lang=En&n=F6765465-1</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref007">
<label>7</label>
<mixed-citation publication-type="other">Holdren JP (Director, Office of Science and Technology Policy, Executive Office of the President, Washington, DC). Increasing access to the results of federally funded scientific research. Memorandum to: Heads of Executive Departments and Agencies. 2013 Feb 22. 6 p. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf">http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf</ext-link>
. Accessed 1 Mar 2014.</mixed-citation>
</ref>
<ref id="pone.0132735.ref008">
<label>8</label>
<mixed-citation publication-type="other">National Institutes of Health plan for increasing access to scientific publications and digital scientific data from NIH funded scientific research. [Bethesda (MD)]: U.S. Department of Health and Human Services, National Institutes of Health; 2015 Feb. 44 p. Available:
<ext-link ext-link-type="uri" xlink:href="http://grants.nih.gov/grants/NIH-Public-Access-Plan.pdf">http://grants.nih.gov/grants/NIH-Public-Access-Plan.pdf</ext-link>
. Accessed 12 Feb 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref009">
<label>9</label>
<mixed-citation publication-type="other">National Institutes of Health (US). Bethesda (MD): U.S. Department of Health and Human Services, National Institutes of Health (US); NIH budget; [reviewed 2015 Jan 29; accessed 2015 Mar 19]; [about 3 screens]. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.nih.gov/about/budget.htm">http://www.nih.gov/about/budget.htm</ext-link>
. Accessed 19 Mar 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref010">
<label>10</label>
<mixed-citation publication-type="other">Big data to knowledge (BD2K). Bethesda (MD): U.S. Department of Health and Human Services, National Institutes of Health (US); 2012 [last updated 2015 Jun 1; accessed 2015 Jun 9]. Available:
<ext-link ext-link-type="uri" xlink:href="https://datascience.nih.gov/bd2k">https://datascience.nih.gov/bd2k</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref011">
<label>11</label>
<mixed-citation publication-type="journal">
<name>
<surname>Margolis</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Derr</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Dunn</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Huerta</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Larkin</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Sheehan</surname>
<given-names>J</given-names>
</name>
,
<etal>et al</etal>
<article-title>The National Institutes of Health's Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data</article-title>
.
<source>J Am Med Inform Assoc</source>
.
<year>2014</year>
<season>Nov-Dec</season>
;
<volume>21</volume>
(
<issue>6</issue>
):
<fpage>957</fpage>
<lpage>8</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1136/amiajnl-2014-002974">10.1136/amiajnl-2014-002974</ext-link>
</comment>
Epub 2014 Jul 9. ; PubMed Central PMCID: PMC4215061. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215061/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4215061/</ext-link>
. Accessed 15 Mar 2015.
<pub-id pub-id-type="pmid">25008006</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref012">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Alsheikh-Ali</surname>
<given-names>AA</given-names>
</name>
,
<name>
<surname>Qureshi</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Al-Mallah</surname>
<given-names>MH</given-names>
</name>
,
<name>
<surname>Ioannidis</surname>
<given-names>JP</given-names>
</name>
.
<article-title>Public availability of published research data in high-impact journals</article-title>
.
<source>PLoS One</source>
.
<year>2011</year>
;
<volume>6</volume>
(
<issue>9</issue>
):
<fpage>e24357</fpage>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0024357">10.1371/journal.pone.0024357</ext-link>
</comment>
Epub 2011 Sep 7. ; PubMed Central PMCID: PMC3168487. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168487/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168487/</ext-link>
. Accessed 20 Feb 2015.
<pub-id pub-id-type="pmid">21915316</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref013">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Mooney</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Newton</surname>
<given-names>MP</given-names>
</name>
.
<article-title>The anatomy of a data citation: discovery, reuse, and credit</article-title>
.
<source>J Librariansh Sch Commun</source>
.
<year>2012</year>
:
<volume>1</volume>
(1):
<fpage>eP1035</fpage>
Available:
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.7710/2162-3309.1035">10.7710/2162-3309.1035</ext-link>
.</comment>
Accessed 20 Feb 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref014">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Belter</surname>
<given-names>CW</given-names>
</name>
.
<article-title>Measuring the value of research data: a citation analysis of oceanographic data sets</article-title>
.
<source>PLoS One</source>
.
<year>2014</year>
<month>3</month>
<day>26</day>
;
<volume>9</volume>
(
<issue>3</issue>
):
<fpage>e92590</fpage>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0092590">10.1371/journal.pone.0092590</ext-link>
</comment>
eCollection 2014. ; PubMed Central PMCID: PMC3966791. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3966791/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3966791/</ext-link>
. Accessed 20 Feb 2015.
<pub-id pub-id-type="pmid">24671177</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref015">
<label>15</label>
<mixed-citation publication-type="journal">
<name>
<surname>Piwowar</surname>
<given-names>HA</given-names>
</name>
,
<name>
<surname>Carlson</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Vision</surname>
<given-names>TJ</given-names>
</name>
.
<article-title>Beginning to track 1000 datasets from public repositories into the published literature</article-title>
.
<source>Proc Am Soc Info Sci Technol</source>
.
<year>2011</year>
[published online 2012 Jan 11; accessed 2013 May 20];
<volume>48</volume>
(
<issue>1</issue>
):
<fpage>1</fpage>
<lpage>4</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1002/meet.2011.14504801337">10.1002/meet.2011.14504801337</ext-link>
</comment>
Available:
<ext-link ext-link-type="uri" xlink:href="http://onlinelibrary.wiley.com/doi/10.1002/meet.2011.14504801337/abstract">http://onlinelibrary.wiley.com/doi/10.1002/meet.2011.14504801337/abstract</ext-link>
Poster.</mixed-citation>
</ref>
<ref id="pone.0132735.ref016">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ariño</surname>
<given-names>A.</given-names>
</name>
<article-title>Approaches to estimating the universe of natural history collections data</article-title>
.
<source>Biodivers Inf</source>
.
<year>2010</year>
;
<volume>7</volume>
(
<issue>2</issue>
):
<fpage>81</fpage>
<lpage>92</lpage>
. Available:
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.17161/bi.v7i2.3991">10.17161/bi.v7i2.3991</ext-link>
.</comment>
Accessed 20 Feb 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref017">
<label>17</label>
<mixed-citation publication-type="journal">
<name>
<surname>Ross</surname>
<given-names>JS</given-names>
</name>
,
<name>
<surname>Tse</surname>
<given-names>T</given-names>
</name>
,
<name>
<surname>Zarin</surname>
<given-names>DA</given-names>
</name>
,
<name>
<surname>Xu</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Zhou</surname>
<given-names>L</given-names>
</name>
,
<name>
<surname>Krumholz</surname>
<given-names>HM</given-names>
</name>
.
<article-title>Publication of NIH funded trials registered in ClinicalTrials.gov: cross sectional analysis</article-title>
.
<source>BMJ</source>
.
<year>2012</year>
<month>1</month>
<day>3</day>
;
<volume>344</volume>
:
<fpage>d7292</fpage>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1136/bmj.d7292">10.1136/bmj.d7292</ext-link>
</comment>
; PubMed Central PMCID: PMC3623605. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3623605/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3623605/</ext-link>
. Accessed 6 Nov 2014.
<pub-id pub-id-type="pmid">22214755</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref018">
<label>18</label>
<mixed-citation publication-type="journal">
<name>
<surname>Vines</surname>
<given-names>TH</given-names>
</name>
,
<name>
<surname>Albert</surname>
<given-names>AY</given-names>
</name>
,
<name>
<surname>Andrew</surname>
<given-names>RL</given-names>
</name>
,
<name>
<surname>Débarre</surname>
<given-names>F</given-names>
</name>
,
<name>
<surname>Bock</surname>
<given-names>DG</given-names>
</name>
,
<name>
<surname>Franklin</surname>
<given-names>MT</given-names>
</name>
,
<etal>et al</etal>
<article-title>The availability of research data declines rapidly with article age</article-title>
.
<source>Curr Biol</source>
.
<year>2014</year>
<month>1</month>
<day>6</day>
;
<volume>24</volume>
(
<issue>1</issue>
):
<fpage>94</fpage>
<lpage>7</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/j.cub.2013.11.014">10.1016/j.cub.2013.11.014</ext-link>
</comment>
Epub 2013 Dec 19. . Accessed 20 Feb 2015]
<pub-id pub-id-type="pmid">24361065</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref019">
<label>19</label>
<mixed-citation publication-type="journal">PubMed help. Bethesda (MD): U.S. National Library of Medicine, National Center for Biotechnology Information; 2005 -. Secondary Source ID; [2 paragraphs]. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Secondary_Source_ID_SI">http://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Secondary_Source_ID_SI</ext-link>
. Accessed 12 Jun 2014.</mixed-citation>
</ref>
<ref id="pone.0132735.ref020">
<label>20</label>
<mixed-citation publication-type="other">PMC help. Bethesda (MD): U.S. National Library of Medicine, National Center for Biotechnology Information; 2005 -. Acknowledgements [ACK]; [1 paragraph]. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/books/NBK3825/#pmchelp.Acknowledgements_ACK">http://www.ncbi.nlm.nih.gov/books/NBK3825/#pmchelp.Acknowledgements_ACK</ext-link>
. Accessed 29 Jul 2014.</mixed-citation>
</ref>
<ref id="pone.0132735.ref021">
<label>21</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hinchliff</surname>
<given-names>CE</given-names>
</name>
,
<name>
<surname>Smith</surname>
<given-names>SA</given-names>
</name>
.
<article-title>Some limitations of public sequence data for phylogenetic inference (in plants)</article-title>
.
<source>PLoS One</source>
.
<year>2014</year>
<month>7</month>
<day>7</day>
;
<volume>9</volume>
(
<issue>7</issue>
):
<fpage>e98986</fpage>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1371/journal.pone.0098986">10.1371/journal.pone.0098986</ext-link>
</comment>
eCollection 2014. ; PubMed Central PMCID: PMC4085032. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4085032/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4085032/</ext-link>
. Accessed 3 Sep 2014.
<pub-id pub-id-type="pmid">24999823</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref022">
<label>22</label>
<mixed-citation publication-type="other">Trans-NIH Biomedical Informatics Coordinating Committee (BMIC). Bethesda (MD): National Institutes of Health, U.S. National Library of Medicine; 2013 Jan 4. NIH data sharing repositories; 2013 Jan 23. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html">http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html</ext-link>
. Accessed 2 Aug 2013.</mixed-citation>
</ref>
<ref id="pone.0132735.ref023">
<label>23</label>
<mixed-citation publication-type="other">National Library of Medicine. Bethesda (MD): National Institutes of Health (US), National Library of Medicine; 1993. MEDLINE PubMed XML element descriptions and their attributes; 2005 Dec [last modified 2012 Dec; accessed 2013 Aug 4]. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html">http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref024">
<label>24</label>
<mixed-citation publication-type="other">PMC. Bethesda (MD): U.S. National Library of Medicine, National Center for Biotechnology Information; 2000. PMC open access subset; [2013; updated 2014 Jan 13; accessed 2014 Dec 10]. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/">http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/</ext-link>
.</mixed-citation>
</ref>
<ref id="pone.0132735.ref025">
<label>25</label>
<mixed-citation publication-type="journal">
<name>
<surname>Robinson-Garcia</surname>
<given-names>N</given-names>
</name>
,
<name>
<surname>Jimenez-Contreras</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Torres-Salinas</surname>
<given-names>D</given-names>
</name>
.
<article-title>Analyzing data citation practices using the Data Citation Index</article-title>
.
<source>J Assoc Inf Sci Technol</source>
.
<year>2015</year>
<month>6</month>
<day>1</day>
:[
<fpage>12</fpage>
p.]. Available:
<ext-link ext-link-type="uri" xlink:href="http://onlinelibrary.wiley.com/doi/10.1002/asi.23529/abstract">http://onlinelibrary.wiley.com/doi/10.1002/asi.23529/abstract</ext-link>
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1002/asi.23529">10.1002/asi.23529</ext-link>
.</comment>
Also available from:
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1501.06285">http://arxiv.org/abs/1501.06285</ext-link>
. Accessed 9 Jun 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref026">
<label>26</label>
<mixed-citation publication-type="journal">
<name>
<surname>Parson</surname>
<given-names>MA</given-names>
</name>
,
<name>
<surname>Duerr</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Minster</surname>
<given-names>JB</given-names>
</name>
.
<article-title>Data citation and peer review</article-title>
.
<source>EOS</source>
.
<year>2010</year>
<month>8</month>
<day>24</day>
;
<volume>91</volume>
(
<issue>34</issue>
):
<fpage>297</fpage>
<lpage>8</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1029/2010EO340001">10.1029/2010EO340001</ext-link>
</comment>
Available:
<ext-link ext-link-type="uri" xlink:href="http://onlinelibrary.wiley.com/doi/10.1029/2010EO340001/full">http://onlinelibrary.wiley.com/doi/10.1029/2010EO340001/full</ext-link>
. Accessed 20 Feb 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref027">
<label>27</label>
<mixed-citation publication-type="journal">
<name>
<surname>Callaghan</surname>
<given-names>S</given-names>
</name>
.
<article-title>Preserving the integrity of the scientific record: data citation and linking</article-title>
.
<source>Learn Publ</source>
.
<year>2014</year>
;
<volume>27</volume>
:
<fpage>S15</fpage>
<lpage>S24</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1087/20140504">10.1087/20140504</ext-link>
</comment>
Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ingentaconnect.com/content/alpsp/lp/2014/00000027/00000005/art00004">http://www.ingentaconnect.com/content/alpsp/lp/2014/00000027/00000005/art00004</ext-link>
. Accessed 18 Feb 2015.</mixed-citation>
</ref>
<ref id="pone.0132735.ref028">
<label>28</label>
<mixed-citation publication-type="journal">
<name>
<surname>Lynch</surname>
<given-names>C</given-names>
</name>
.
<article-title>The shape of the scientific article in the developing cyberinfrastructure</article-title>
.
<source>CTWatch Q</source>
.
<year>2007</year>
<month>8</month>
;
<volume>3</volume>
(
<issue>3</issue>
):
<fpage>5</fpage>
<lpage>10</lpage>
. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ctwatch.org/quarterly/articles/2007/08/the-shape-of-the-scientific-article-in-the-developing-cyberinfrastructure/">http://www.ctwatch.org/quarterly/articles/2007/08/the-shape-of-the-scientific-article-in-the-developing-cyberinfrastructure/</ext-link>
. Accessed 12 Feb 2013.</mixed-citation>
</ref>
<ref id="pone.0132735.ref029">
<label>29</label>
<mixed-citation publication-type="journal">
<name>
<surname>Lindberg</surname>
<given-names>DA</given-names>
</name>
.
<article-title>Research opportunities and challenges in 2005</article-title>
.
<source>Methods Inf Med</source>
.
<year>2005</year>
;
<volume>44</volume>
(
<issue>4</issue>
):
<fpage>483</fpage>
<lpage>6</lpage>
. Available:
<ext-link ext-link-type="uri" xlink:href="http://methods.schattauer.de/en/contents/archivestandard/issue/685/manuscript/504/show.html">http://methods.schattauer.de/en/contents/archivestandard/issue/685/manuscript/504/show.html</ext-link>
. Accessed 28 Apr 2015.
<pub-id pub-id-type="pmid">16342914</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref030">
<label>30</label>
<mixed-citation publication-type="journal">
<name>
<surname>Thoma</surname>
<given-names>GR</given-names>
</name>
,
<name>
<surname>Ford</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Antani</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Demner-Fushman</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Chung</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Simpson</surname>
<given-names>M</given-names>
</name>
.
<article-title>Interactive publication: the document as a research tool</article-title>
.
<source>Web Semant</source>
.
<year>2010 Jul</year>
<day>1</day>
;
<volume>8</volume>
(
<issue>2–3</issue>
):
<fpage>145</fpage>
<lpage>150</lpage>
. ; PubMed Central PMCID: PMC2908409. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2908409/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2908409/</ext-link>
. Accessed 2 Mar 2015.
<pub-id pub-id-type="pmid">20657757</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref031">
<label>31</label>
<mixed-citation publication-type="journal">
<name>
<surname>Mons</surname>
<given-names>B</given-names>
</name>
,
<name>
<surname>van Haagen</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Chichester</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Hoen</surname>
<given-names>PB</given-names>
</name>
,
<name>
<surname>den Dunnen</surname>
<given-names>JT</given-names>
</name>
,
<name>
<surname>van Ommen</surname>
<given-names>G</given-names>
</name>
,
<etal>et al</etal>
<article-title>The value of data</article-title>
.
<source>Nat Genet</source>
.
<year>2011</year>
<month>3</month>
<day>29</day>
[accessed 2015 Feb 20];
<volume>43</volume>
(
<issue>4</issue>
):
<fpage>281</fpage>
<lpage>3</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1038/ng0411-281">10.1038/ng0411-281</ext-link>
</comment>
.
<pub-id pub-id-type="pmid">21445068</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref032">
<label>32</label>
<mixed-citation publication-type="other">DataCite. London: DataCite; [accessed 2014 Aug 11]. DataCite Metadata Schema Repository; [last updated 2013 Jul 24; accessed 2014 Aug 11]. Available:
<ext-link ext-link-type="uri" xlink:href="http://schema.datacite.org/">http://schema.datacite.org/</ext-link>
.</mixed-citation>
</ref>
<ref id="pone.0132735.ref033">
<label>33</label>
<mixed-citation publication-type="other">Dryad Digital Repository. Durham (NC): Dryad. 2008 Jan—. Metadata profile: Dryad metadata application profile (schema); [last modified 2013 Feb 27; accessed 2014 Aug 3]. Available:
<ext-link ext-link-type="uri" xlink:href="http://wiki.datadryad.org/Metadata_Profile">http://wiki.datadryad.org/Metadata_Profile</ext-link>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref034">
<label>34</label>
<mixed-citation publication-type="other">W3C. [place unknown]: World Wide Web Consortium; c2014. Data Catalogue Vocabulary (DCAT); 2014 Jan 16. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.w3.org/TR/vocab-dcat/">http://www.w3.org/TR/vocab-dcat/</ext-link>
. W3C recommendation. Accessed 7 Feb 2014.</mixed-citation>
</ref>
<ref id="pone.0132735.ref035">
<label>35</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chavan</surname>
<given-names>V</given-names>
</name>
,
<name>
<surname>Penev</surname>
<given-names>L</given-names>
</name>
.
<article-title>The data paper: a mechanism to incentivize data publishing in biodiversity science</article-title>
.
<source>BMC Bioinformatics</source>
.
<year>2011</year>
;
<volume>12</volume>
<issue>Suppl 15</issue>
:
<fpage>S2</fpage>
; PubMed Central PMCID: PMC3287445. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287445/">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3287445/</ext-link>
. Accessed 20 Feb 2015.
<pub-id pub-id-type="pmid">22373175</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref036">
<label>36</label>
<mixed-citation publication-type="journal">
<name>
<surname>Costello</surname>
<given-names>MJ</given-names>
</name>
,
<name>
<surname>Michener</surname>
<given-names>WK</given-names>
</name>
,
<name>
<surname>Gahegan</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>ZQ</given-names>
</name>
,
<name>
<surname>Bourne</surname>
<given-names>PE</given-names>
</name>
.
<article-title>Biodiversity data should be published, cited, and peer reviewed</article-title>
.
<source>Trends Ecol Evol</source>
.
<year>2013</year>
<month>8</month>
[accessed 2015 Feb 20];
<volume>28</volume>
(
<issue>8</issue>
):
<fpage>454</fpage>
<lpage>61</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.1016/j.tree.2013.05.002">10.1016/j.tree.2013.05.002</ext-link>
</comment>
Epub 2013 Jun 5. .
<pub-id pub-id-type="pmid">23756105</pub-id>
</mixed-citation>
</ref>
<ref id="pone.0132735.ref037">
<label>37</label>
<mixed-citation publication-type="journal">
<name>
<surname>Rousidis</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Garoufallou</surname>
<given-names>E</given-names>
</name>
,
<name>
<surname>Balatsoukas</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Sicilia</surname>
<given-names>MA</given-names>
</name>
.
<article-title>Metadata for Big Data : a preliminary investigation of metadata quality issues in research data repositories</article-title>
.
<source>Inf Serv Use</source>
.
<year>2014</year>
;
<volume>34</volume>
(
<issue>3–4</issue>
):
<fpage>279</fpage>
<lpage>86</lpage>
.
<comment>doi:
<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.3233/ISU-140746">10.3233/ISU-140746</ext-link>
.</comment>
Accessed 12 Dec 2014.</mixed-citation>
</ref>
<ref id="pone.0132735.ref038">
<label>38</label>
<mixed-citation publication-type="other">Big data to knowledge (BD2K). Bethesda (MD): U.S. Department of Health and Human Services, National Institutes of Health (US); 2012 [last updated 2015 Jun 1]. Data Discovery Index Coordination Consortium (DDICC) (University of California, San Diego). BioCADDIE: Biomedical and healthcare data discovery and indexing engine center; [about 1 p.]. Available:
<ext-link ext-link-type="uri" xlink:href="https://datascience.nih.gov/sites/default/files/bd2k/docs/DDIC.pdf">https://datascience.nih.gov/sites/default/files/bd2k/docs/DDIC.pdf</ext-link>
. Accessed 9 Jun 2015.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000076  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000076  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024