OcrV1, Ncbi, Merge, bibRecord, 000176

A cross disciplinary study of link decay and the effectiveness of mitigation techniques

Identifieur interne : 000176 ( Ncbi/Merge ); précédent : 000175; suivant : 000177

A cross disciplinary study of link decay and the effectiveness of mitigation techniques

Auteurs : Jason Hennessey [États-Unis] ; Steven Xijin Ge [États-Unis]

Source :

BMC Bioinformatics [ 1471-2105 ] ; 2013.

RBID : PMC:3851533

Abstract

Background

The dynamic, decentralized world-wide-web has become an essential part of scientific research and communication. Researchers create thousands of web sites every year to share software, data and services. These valuable resources tend to disappear over time. The problem has been documented in many subject areas. Our goal is to conduct a cross-disciplinary investigation of the problem and test the effectiveness of existing remedies.

Results

We accessed 14,489 unique web pages found in the abstracts within Thomson Reuters' Web of Science citation index that were published between 1996 and 2010 and found that the median lifespan of these web pages was 9.3 years with 62% of them being archived. Survival analysis and logistic regression were used to find significant predictors of URL lifespan. The availability of a web page is most dependent on the time it is published and the top-level domain names. Similar statistical analysis revealed biases in current solutions: the Internet Archive favors web pages with fewer layers in the Universal Resource Locator (URL) while WebCite is significantly influenced by the source of publication. We also created a prototype for a process to submit web pages to the archives and increased coverage of our list of scientific webpages in the Internet Archive and WebCite by 22% and 255%, respectively.

Conclusion

Our results show that link decay continues to be a problem across different disciplines and that current solutions for static web pages are helping and can be improved.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851533

DOI: 10.1186/1471-2105-14-S14-S5
PubMed: 24266891
PubMed Central: 3851533

Links toward previous steps (curation, corpus...)

to stream Pmc, to step Corpus: 000071
to stream Pmc, to step Curation: 000071
to stream Pmc, to step Checkpoint: 000082

Links to Exploration step

PMC:3851533

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A cross disciplinary study of link decay and the effectiveness of mitigation techniques</title>
<author><name sortKey="Hennessey, Jason" sort="Hennessey, Jason" uniqKey="Hennessey J" first="Jason" last="Hennessey">Jason Hennessey</name>
<affiliation wicri:level="2"><nlm:aff id="I1">Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007</wicri:regionArea>
<placeName><region type="state">Dakota du Sud</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Ge, Steven Xijin" sort="Ge, Steven Xijin" uniqKey="Ge S" first="Steven Xijin" last="Ge">Steven Xijin Ge</name>
<affiliation wicri:level="2"><nlm:aff id="I1">Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007</wicri:regionArea>
<placeName><region type="state">Dakota du Sud</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">24266891</idno>
<idno type="pmc">3851533</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851533</idno>
<idno type="RBID">PMC:3851533</idno>
<idno type="doi">10.1186/1471-2105-14-S14-S5</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000071</idno>
<idno type="wicri:Area/Pmc/Curation">000071</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000082</idno>
<idno type="wicri:Area/Ncbi/Merge">000176</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">A cross disciplinary study of link decay and the effectiveness of mitigation techniques</title>
<author><name sortKey="Hennessey, Jason" sort="Hennessey, Jason" uniqKey="Hennessey J" first="Jason" last="Hennessey">Jason Hennessey</name>
<affiliation wicri:level="2"><nlm:aff id="I1">Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007</wicri:regionArea>
<placeName><region type="state">Dakota du Sud</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Ge, Steven Xijin" sort="Ge, Steven Xijin" uniqKey="Ge S" first="Steven Xijin" last="Ge">Steven Xijin Ge</name>
<affiliation wicri:level="2"><nlm:aff id="I1">Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007</wicri:regionArea>
<placeName><region type="state">Dakota du Sud</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>The dynamic, decentralized world-wide-web has become an essential part of scientific research and communication. Researchers create thousands of web sites every year to share software, data and services. These valuable resources tend to disappear over time. The problem has been documented in many subject areas. Our goal is to conduct a cross-disciplinary investigation of the problem and test the effectiveness of existing remedies.</p>
</sec>
<sec><title>Results</title>
<p>We accessed 14,489 unique web pages found in the abstracts within Thomson Reuters' Web of Science citation index that were published between 1996 and 2010 and found that the median lifespan of these web pages was 9.3 years with 62% of them being archived. Survival analysis and logistic regression were used to find significant predictors of URL lifespan. The availability of a web page is most dependent on the time it is published and the top-level domain names. Similar statistical analysis revealed biases in current solutions: the Internet Archive favors web pages with fewer layers in the Universal Resource Locator (URL) while WebCite is significantly influenced by the source of publication. We also created a prototype for a process to submit web pages to the archives and increased coverage of our list of scientific webpages in the Internet Archive and WebCite by 22% and 255%, respectively.</p>
</sec>
<sec><title>Conclusion</title>
<p>Our results show that link decay continues to be a problem across different disciplines and that current solutions for static web pages are helping and can be improved.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Ducut, E" uniqKey="Ducut E">E Ducut</name>
</author>
<author><name sortKey="Liu, F" uniqKey="Liu F">F Liu</name>
</author>
<author><name sortKey="Fontelo, P" uniqKey="Fontelo P">P Fontelo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Aronsky, D" uniqKey="Aronsky D">D Aronsky</name>
</author>
<author><name sortKey="Madani, S" uniqKey="Madani S">S Madani</name>
</author>
<author><name sortKey="Carnevale, Rj" uniqKey="Carnevale R">RJ Carnevale</name>
</author>
<author><name sortKey="Duda, S" uniqKey="Duda S">S Duda</name>
</author>
<author><name sortKey="Feyder, Mt" uniqKey="Feyder M">MT Feyder</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wren, Jd" uniqKey="Wren J">JD Wren</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wren, Jd" uniqKey="Wren J">JD Wren</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yang, Sl" uniqKey="Yang S">SL Yang</name>
</author>
<author><name sortKey="Qiu, Jp" uniqKey="Qiu J">JP Qiu</name>
</author>
<author><name sortKey="Xiong, Zy" uniqKey="Xiong Z">ZY Xiong</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Eysenbach, G" uniqKey="Eysenbach G">G Eysenbach</name>
</author>
<author><name sortKey="Trudell, M" uniqKey="Trudell M">M Trudell</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Wren, Jd" uniqKey="Wren J">JD Wren</name>
</author>
<author><name sortKey="Johnson, Kr" uniqKey="Johnson K">KR Johnson</name>
</author>
<author><name sortKey="Crockett, Dm" uniqKey="Crockett D">DM Crockett</name>
</author>
<author><name sortKey="Heilig, Lf" uniqKey="Heilig L">LF Heilig</name>
</author>
<author><name sortKey="Schilling, Lm" uniqKey="Schilling L">LM Schilling</name>
</author>
<author><name sortKey="Dellavalle, Rp" uniqKey="Dellavalle R">RP Dellavalle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Casserly, Mf" uniqKey="Casserly M">MF Casserly</name>
</author>
<author><name sortKey="Bird, Je" uniqKey="Bird J">JE Bird</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Wagner, C" uniqKey="Wagner C">C Wagner</name>
</author>
<author><name sortKey="Gebremichael, Md" uniqKey="Gebremichael M">MD Gebremichael</name>
</author>
<author><name sortKey="Taylor, Mk" uniqKey="Taylor M">MK Taylor</name>
</author>
<author><name sortKey="Soltys, Mj" uniqKey="Soltys M">MJ Soltys</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Koehler, W" uniqKey="Koehler W">W Koehler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bar Ilan, J" uniqKey="Bar Ilan J">J Bar-Ilan</name>
</author>
<author><name sortKey="Peritz, Bc" uniqKey="Peritz B">BC Peritz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Koehler, W" uniqKey="Koehler W">W Koehler</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Casserly, Mf" uniqKey="Casserly M">MF Casserly</name>
</author>
<author><name sortKey="Bird, Je" uniqKey="Bird J">JE Bird</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Peng, Rd" uniqKey="Peng R">RD Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ince, Dc" uniqKey="Ince D">DC Ince</name>
</author>
<author><name sortKey="Hatton, L" uniqKey="Hatton L">L Hatton</name>
</author>
<author><name sortKey="Graham Cumming, J" uniqKey="Graham Cumming J">J Graham-Cumming</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Therneau, T" uniqKey="Therneau T">T Therneau</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Markwell, J" uniqKey="Markwell J">J Markwell</name>
</author>
<author><name sortKey="Brooks, Dw" uniqKey="Brooks D">DW Brooks</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Thorp, Aw" uniqKey="Thorp A">AW Thorp</name>
</author>
<author><name sortKey="Brown, L" uniqKey="Brown L">L Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Carnevale, Rj" uniqKey="Carnevale R">RJ Carnevale</name>
</author>
<author><name sortKey="Aronsky, D" uniqKey="Aronsky D">D Aronsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dimitrova, Dv" uniqKey="Dimitrova D">DV Dimitrova</name>
</author>
<author><name sortKey="Bugeja, M" uniqKey="Bugeja M">M Bugeja</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Duda, Jj" uniqKey="Duda J">JJ Duda</name>
</author>
<author><name sortKey="Camp, Rj" uniqKey="Camp R">RJ Camp</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rhodes, S" uniqKey="Rhodes S">S Rhodes</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Goh, Dhl" uniqKey="Goh D">DHL Goh</name>
</author>
<author><name sortKey="Ng, Pk" uniqKey="Ng P">PK Ng</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Russell, E" uniqKey="Russell E">E Russell</name>
</author>
<author><name sortKey="Kane, J" uniqKey="Kane J">J Kane</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dellavalle, Rp" uniqKey="Dellavalle R">RP Dellavalle</name>
</author>
<author><name sortKey="Hester, Ej" uniqKey="Hester E">EJ Hester</name>
</author>
<author><name sortKey="Heilig, Lf" uniqKey="Heilig L">LF Heilig</name>
</author>
<author><name sortKey="Drake, Al" uniqKey="Drake A">AL Drake</name>
</author>
<author><name sortKey="Kuntzman, Jw" uniqKey="Kuntzman J">JW Kuntzman</name>
</author>
<author><name sortKey="Graber, M" uniqKey="Graber M">M Graber</name>
</author>
<author><name sortKey="Schilling, Lm" uniqKey="Schilling L">LM Schilling</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Evangelou, E" uniqKey="Evangelou E">E Evangelou</name>
</author>
<author><name sortKey="Trikalinos, Ta" uniqKey="Trikalinos T">TA Trikalinos</name>
</author>
<author><name sortKey="Ioannidis, Jpa" uniqKey="Ioannidis J">JPA Ioannidis</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sellitto, C" uniqKey="Sellitto C">C Sellitto</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bar Ilan, J" uniqKey="Bar Ilan J">J Bar-Ilan</name>
</author>
<author><name sortKey="Peritz, B" uniqKey="Peritz B">B Peritz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gomes, D" uniqKey="Gomes D">D Gomes</name>
</author>
<author><name sortKey="Silva, Mj" uniqKey="Silva M">MJ Silva</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Markwell, J" uniqKey="Markwell J">J Markwell</name>
</author>
<author><name sortKey="Brooks, Dw" uniqKey="Brooks D">DW Brooks</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wu, Zq" uniqKey="Wu Z">ZQ Wu</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="abstract" xml:lang="en"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group><journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher><publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">24266891</article-id>
<article-id pub-id-type="pmc">3851533</article-id>
<article-id pub-id-type="publisher-id">1471-2105-14-S14-S5</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-14-S14-S5</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Proceedings</subject>
</subj-group>
</article-categories>
<title-group><article-title>A cross disciplinary study of link decay and the effectiveness of mitigation techniques</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" id="A1"><name><surname>Hennessey</surname>
<given-names>Jason</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>Jason.Hennessey@jacks.sdstate.edu</email>
</contrib>
<contrib contrib-type="author" corresp="yes" id="A2"><name><surname>Ge</surname>
<given-names>Steven Xijin</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>Xijin.Ge@sdstate.edu</email>
</contrib>
</contrib-group>
<aff id="I1"><label>1</label>
Department of Mathematics and Statistics, South Dakota State University, Box 2220, Brookings, SD 57007, USA</aff>
<pub-date pub-type="collection"><year>2013</year>
</pub-date>
<pub-date pub-type="epub"><day>9</day>
<month>10</month>
<year>2013</year>
</pub-date>
<volume>14</volume>
<issue>Suppl 14</issue>
<supplement><named-content content-type="supplement-title">Proceedings of the Tenth Annual MCBIOS Conference</named-content>
<named-content content-type="supplement-editor">Jonathan D Wren (Senior Editor), Mikhail G Dozmorov, Dennis Burian, Rakesh Kaundal, Andy Perkins, Ed Perkins, Doris M Kupfer and Gordon K Springer</named-content>
<named-content content-type="supplement-sponsor">Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. Articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</named-content>
</supplement>
<fpage>S5</fpage>
<lpage>S5</lpage>
<permissions><copyright-statement>Copyright © 2013 Hennessey and Ge; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2013</copyright-year>
<copyright-holder>Hennessey and Ge; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/2.0"><license-p>This is an open access article distributed under the terms of the Creative Commons Attribution License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/2.0">http://creativecommons.org/licenses/by/2.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/14/S14/S5"></self-uri>
<abstract><sec><title>Background</title>
<p>The dynamic, decentralized world-wide-web has become an essential part of scientific research and communication. Researchers create thousands of web sites every year to share software, data and services. These valuable resources tend to disappear over time. The problem has been documented in many subject areas. Our goal is to conduct a cross-disciplinary investigation of the problem and test the effectiveness of existing remedies.</p>
</sec>
<sec><title>Results</title>
<p>We accessed 14,489 unique web pages found in the abstracts within Thomson Reuters' Web of Science citation index that were published between 1996 and 2010 and found that the median lifespan of these web pages was 9.3 years with 62% of them being archived. Survival analysis and logistic regression were used to find significant predictors of URL lifespan. The availability of a web page is most dependent on the time it is published and the top-level domain names. Similar statistical analysis revealed biases in current solutions: the Internet Archive favors web pages with fewer layers in the Universal Resource Locator (URL) while WebCite is significantly influenced by the source of publication. We also created a prototype for a process to submit web pages to the archives and increased coverage of our list of scientific webpages in the Internet Archive and WebCite by 22% and 255%, respectively.</p>
</sec>
<sec><title>Conclusion</title>
<p>Our results show that link decay continues to be a problem across different disciplines and that current solutions for static web pages are helping and can be improved.</p>
</sec>
</abstract>
<conference><conf-date>5-6 April 2013</conf-date>
<conf-name>Tenth Annual MCBIOS Conference. Discovery in a sea of data</conf-name>
<conf-loc>Columbia, MO, USA</conf-loc>
</conference>
</article-meta>
</front>
<body><sec><title>Background</title>
<p>Scholarly Internet resources play an increasingly important role in modern research. We can see this by the increasing number of URLs published in a paper's title or abstract [<xref ref-type="bibr" rid="B1">1</xref>
](also see Figure <xref ref-type="fig" rid="F1">1</xref>
). Until now, maintaining the availability of scientific contributions has been decentralized, mature and effective, utilizing methods developed over centuries to archive the books and journals in which they were communicated. As the Internet is still a relatively new medium for communicating scientific thought, the community is still figuring out how best to use it in a way that preserves contributions for years to come. One problem is that continued availability of these online resources is at the mercy of the organizations or individuals that host them. Many disappear after publication (and some even disappear before[<xref ref-type="bibr" rid="B2">2</xref>
]), leading to a well-documented phenomenon referred to as link rot or link decay.</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p><bold>Growth of scholarly online resources</bold>
. Not only are the number of URL-containing articles (those with "http" in the title or abstract) published per year increasing (dotted line), but also the percentage of published items containing URLs (solid line). The annual increase in articles according to a linear fit was 174 with R<sup>2 </sup>
0.97. The linear trend for the percentage was an increase of 0.010% per year with R<sup>2 </sup>
0.98. Source: Thomas Reuter's Web of Science</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S5-1"></graphic>
</fig>
<p>The problem has been documented in several subject areas, with Table <xref ref-type="table" rid="T1">1</xref>
 containing a large list of these subject-specific studies. In terms of wide, cross-disciplinary analyses, the closest thus far are those of the biological and medical MEDLINE and PubMed databases by Ducut [<xref ref-type="bibr" rid="B1">1</xref>
] and Wren [<xref ref-type="bibr" rid="B3">3</xref>
,<xref ref-type="bibr" rid="B4">4</xref>
], in addition to Yang's study of the Social Sciences within the Chinese Social Sciences Citation Index (CSSCI) [<xref ref-type="bibr" rid="B5">5</xref>
].</p>
<table-wrap id="T1" position="float"><label>Table 1</label>
<caption><p>Link decay has been studied for several years in specific subject areas.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Field</th>
<th align="center">Links Source/Type</th>
<th align="center">Year(s) of URLs</th>
<th align="center">N</th>
<th align="center">Reference</th>
</tr>
</thead>
<tbody><tr><td align="center">Biology & Medicine</td>
<td align="center">Science curriculum web links</td>
<td align="center">2000</td>
<td align="center">515</td>
<td align="center">[<xref ref-type="bibr" rid="B24">24</xref>
]</td>
</tr>
<tr><td></td>
<td colspan="4"><hr></hr>
</td>
</tr>
<tr><td></td>
<td align="center">Full text of 3 dermatology journals</td>
<td align="center">1999-2004</td>
<td align="center">1113</td>
<td align="center">[<xref ref-type="bibr" rid="B11">11</xref>
]</td>
</tr>
<tr><td></td>
<td colspan="4"><hr></hr>
</td>
</tr>
<tr><td></td>
<td align="center">Sample of bibliographies being published on PubMed</td>
<td align="center">2006</td>
<td align="center">840</td>
<td align="center">[<xref ref-type="bibr" rid="B2">2</xref>
]</td>
</tr>
<tr><td></td>
<td colspan="4"><hr></hr>
</td>
</tr>
<tr><td></td>
<td align="center">References made in the <italic>Annals of Emergency Medicine</italic>
</td>
<td align="center">2000, 2003, 2005</td>
<td align="center">586</td>
<td align="center">[<xref ref-type="bibr" rid="B25">25</xref>
]</td>
</tr>
<tr><td></td>
<td colspan="4"><hr></hr>
</td>
</tr>
<tr><td></td>
<td align="center">References in 5 biomedical informatics journals.</td>
<td align="center">1999-2004</td>
<td align="center">1049</td>
<td align="center">[<xref ref-type="bibr" rid="B26">26</xref>
]</td>
</tr>
<tr><td></td>
<td colspan="4"><hr></hr>
</td>
</tr>
<tr><td></td>
<td align="center">MEDLINE titles & abstracts</td>
<td align="center">1994-2006</td>
<td align="center">10208</td>
<td align="center">[<xref ref-type="bibr" rid="B1">1</xref>
]*</td>
</tr>
<tr><td></td>
<td colspan="4"><hr></hr>
</td>
</tr>
<tr><td></td>
<td align="center">Internet citations in 5 health care management journals from 2002-2004</td>
<td align="center">2009-2010</td>
<td align="center">2011</td>
<td align="center">[<xref ref-type="bibr" rid="B14">14</xref>
]</td>
</tr>
<tr><td></td>
<td colspan="4"><hr></hr>
</td>
</tr>
<tr><td></td>
<td align="center">MEDLINE abstracts</td>
<td align="center">1995-2007</td>
<td align="center">7462</td>
<td align="center">[<xref ref-type="bibr" rid="B3">3</xref>
]*</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Communications</td>
<td align="center">Citations appearing in research articles in 6 leading communications journals</td>
<td align="center">2000-2003</td>
<td align="center">1600</td>
<td align="center">[<xref ref-type="bibr" rid="B27">27</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Ecology</td>
<td align="center">URLs appearing in the full text of 4 Ecological Society of America journals</td>
<td align="center">1997-2005</td>
<td align="center">2100</td>
<td align="center">[<xref ref-type="bibr" rid="B28">28</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Law</td>
<td align="center">Samples from a collection of born-digital law- and policy-related reports and documents</td>
<td align="center">2007-2010</td>
<td align="center">2372</td>
<td align="center">[<xref ref-type="bibr" rid="B29">29</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Library/Information Science</td>
<td align="center">Citations appearing in 3 leading Information Science journals</td>
<td align="center">1997-2003</td>
<td align="center">2516</td>
<td align="center">[<xref ref-type="bibr" rid="B30">30</xref>
]</td>
</tr>
<tr><td></td>
<td align="center">Sample of citations appearing in library and information science journals</td>
<td align="center">1999-2000</td>
<td align="center">500</td>
<td align="center">[<xref ref-type="bibr" rid="B18">18</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Social Sciences</td>
<td align="center">URLs appearing in the full text of 2 well-respected historical journals</td>
<td align="center">1999-2006</td>
<td align="center">510</td>
<td align="center">[<xref ref-type="bibr" rid="B31">31</xref>
]</td>
</tr>
<tr><td></td>
<td align="center">Citations from articles in the Chinese Social Sciences Index</td>
<td align="center">1998-2007</td>
<td align="center">44973</td>
<td align="center">[<xref ref-type="bibr" rid="B5">5</xref>
]*</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Various</td>
<td align="center">Random Collection of web URLs</td>
<td align="center">1996</td>
<td align="center">371</td>
<td align="center">[<xref ref-type="bibr" rid="B15">15</xref>
,<xref ref-type="bibr" rid="B17">17</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Various</td>
<td align="center">Citations in 3 highly circulated journals</td>
<td align="center">2002-2003</td>
<td align="center">672</td>
<td align="center">[<xref ref-type="bibr" rid="B32">32</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Various</td>
<td align="center">Supplementary information published in 6 top-cited journals</td>
<td align="center">2000, 2003</td>
<td align="center">585</td>
<td align="center">[<xref ref-type="bibr" rid="B33">33</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Various</td>
<td align="center">Citations from conference articles</td>
<td align="center">1995-2003</td>
<td align="center">1068</td>
<td align="center">[<xref ref-type="bibr" rid="B34">34</xref>
]</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Various Collections</td>
<td></td>
<td></td>
<td></td>
<td align="center">[<xref ref-type="bibr" rid="B35">35</xref>
-<xref ref-type="bibr" rid="B38">38</xref>
]</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>* denotes studies most similar to the current.</p>
</table-wrap-foot>
</table-wrap>
<p>Some solutions have been proposed which attack the problem from different angles. The Internet Archive (IA) [<xref ref-type="bibr" rid="B6">6</xref>
] and WebCite (WC) [<xref ref-type="bibr" rid="B7">7</xref>
] address the issue by archiving web pages, though their mechanisms for acquiring those pages differ. The IA, beginning from a partnership with the Alexa search engine, employs an algorithm that crawls the Internet at large, storing snapshots of pages it encounters along the way. In contrast, WebCite archives only those pages which are submitted to it, and it is geared toward the scientific community. These two methods, however, can only capture information that is visible from the client. Logic and data housed on the server are not frequently available.</p>
<p>Other tools, like the Digital Object Identifier (DOI) System [<xref ref-type="bibr" rid="B8">8</xref>
] and Persistent Uniform Resource Locator (PURL) [<xref ref-type="bibr" rid="B9">9</xref>
], provide solutions for when a web resource is moved to a different URL but is still available. The DOI System was created by an international consortium of organizations wishing to assign unique identifiers to items such as movies, television shows, books, journal articles, web sites and data sets. It encompasses several thousand "Naming Authorities" organized under a few "Registration Agencies" that have a lot of flexibility in their business models[<xref ref-type="bibr" rid="B10">10</xref>
]. Perhaps 30-60% of link rot could be solved using DOIs and PURLs[<xref ref-type="bibr" rid="B11">11</xref>
,<xref ref-type="bibr" rid="B12">12</xref>
]. However they are not without pitfalls. One is that a researcher or company could stop caring about a particular tool for various reasons and thus not be interested in updating its permanent identifier. Another is that the one wanting the permanent URL (the publishing author) is frequently not the same as the person administering the site itself over the long term, thus we have an imbalance of desire vs. responsibilities between the two parties. A third in the case of the DOI System is that there may be a cost in terms of money and time associated with registering their organization that could be prohibitive to authors that don't already have access to a Naming Authority[<xref ref-type="bibr" rid="B1">1</xref>
]. One example of a DOI System business model would be that of the California Digital Library's EZID service, which charges a flat rate (currently $2,500 for a research institution) for up to 1 million DOIs per year[<xref ref-type="bibr" rid="B13">13</xref>
].</p>
<p>In this study, we ask two questions: what are the problem's characteristics in scientific literature as a whole and how is it being addressed? To assess progress in combating the problem, we evaluate the effectiveness of the two most prevalent preservation engines: and examine the effectiveness of one prototyped solution. If a URL is published in the abstract, it is assumed that the URL plays a prominent role within that paper, similar to the rationale proposed by Wren [<xref ref-type="bibr" rid="B4">4</xref>
].</p>
</sec>
<sec sec-type="results"><title>Results</title>
<p>Our goals are to provide some metrics that are useful in understanding the problem of link decay in a cross-disciplinary fashion and to examine the effectiveness of the existing archival methods while proposing some incremental improvements. To accomplish these tasks, we downloaded 18,231 Web of Science (WOS) abstracts containing "http" in the title or abstract from the years under study (1996-2010), out of which 17,110 URLs (14,489 unique) were extracted and used. We developed Python scripts to access these URLs over a 30-day period. For the period studied, 69% of the published URLs (67% of the unique) were available on the live Internet, the Internet Archive's Wayback Machine had archived 62% (59% unique) of the total and WebCite had 21% (16% unique). Overall, 65% of all URLs (62% unique) were available from one of the two surveyed archival engines. Figure <xref ref-type="fig" rid="F2">2</xref>
 contains a breakdown by year for availability on the live web as well as through the combined archives, and Figure <xref ref-type="fig" rid="F3">3</xref>
 illustrates each archival engine's coverage. The median lifetime for published URLs was found to be 9.3 years (95% CI [9.3,10.0]), with the median lifetime amongst unique URLs also being 9.3 years (95% CI [9.3,9.3]). Subject-specific lifetimes may be found in Table <xref ref-type="table" rid="T2">2</xref>
. Using a simple linear model, the chances that a URL published in a particular year is still available goes down by 3.7% for each year added to its age with an R<sup>2 </sup>
of 0.96. Its chances of being archived go up after an initial period of flux (see Figure <xref ref-type="fig" rid="F2">2</xref>
). Submitting our list of unarchived but living URLs to the archival engines showed dramatic promise, increasing the Internet Archive's coverage of the dataset by 2080 URLs, an increase of 22%, and WebCite's by 6348, an increase of 255%.</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p><bold>The accessibility of URLs from a particular year is closely correlated with age</bold>
. The probability of being available (solid line) declines by 3.7% every year based on a linear model with R<sup>2 </sup>
0.96. The surveyed archival engines have about a 70-80% archival rate (dotted line) following an initial ramp time.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S5-2"></graphic>
</fig>
<fig id="F3" position="float"><label>Figure 3</label>
<caption><p><bold>URL presence in the archives</bold>
. Percentage of URLs found in the archives of the Internet Archive (dashed line), WebCite (dotted line) or in any group (solid line). IA is older, and thus accounts for the lion's share of earlier published URLs, though as time goes on WebCite is offering more and more.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S5-3"></graphic>
</fig>
<table-wrap id="T2" position="float"><label>Table 2</label>
<caption><p>Comparison of certain statistics based on the subject of a given URL.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Subject</th>
<th align="center">Total</th>
<th align="center"># Alive (%)</th>
<th align="center">Median Survival with 95% CI in years</th>
</tr>
</thead>
<tbody><tr><td align="center">Biochemistry & Molecular Biology</td>
<td align="center">4585</td>
<td align="center">3231 (70%)</td>
<td align="center">10.8 (9.0,11.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Biotechnology & Applied Microbiology</td>
<td align="center">2225</td>
<td align="center">1586 (71%)</td>
<td align="center">9.0 (8.8,9.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Computer Science</td>
<td align="center">2073</td>
<td align="center">1225 (59%)</td>
<td align="center">8.3 (7.0,9.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Biochemical Research Methods</td>
<td align="center">2023</td>
<td align="center">1463 (72%)</td>
<td align="center">8.5 (8.5,8.6)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Mathematical & Computational Biology</td>
<td align="center">1661</td>
<td align="center">1200 (72%)</td>
<td align="center">7.5 (7.5,9.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Genetics & Heredity</td>
<td align="center">1302</td>
<td align="center">914 (70%)</td>
<td align="center">8.8 (8.8,10.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Physics</td>
<td align="center">809</td>
<td align="center">458 (57%)</td>
<td align="center">8.0 (7.6,9.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Engineering</td>
<td align="center">703</td>
<td align="center">419 (60%)</td>
<td align="center">7.2 (7.1,10.5)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Statistics & Probability</td>
<td align="center">699</td>
<td align="center">440 (63%)</td>
<td align="center">7.6 (7.0,9.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Chemistry</td>
<td align="center">591</td>
<td align="center">397 (67%)</td>
<td align="center">11.4 (9.0,11.9)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Biophysics</td>
<td align="center">432</td>
<td align="center">270 (63%)</td>
<td align="center">10.1 (10.1,10.1)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Astronomy & Astrophysics</td>
<td align="center">416</td>
<td align="center">268 (64%)</td>
<td align="center">11.3 (11.1,NA)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Mathematics</td>
<td align="center">406</td>
<td align="center">254 (63%)</td>
<td align="center">10.7 (4.5,NA)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Zoology</td>
<td align="center">357</td>
<td align="center">319 (89%)</td>
<td align="center">11.2 (9.6,NA)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Cell Biology</td>
<td align="center">353</td>
<td align="center">242 (69%)</td>
<td align="center">8.0 (8.0,10.8)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Biology</td>
<td align="center">346</td>
<td align="center">242 (70%)</td>
<td align="center">9.8 (7.3,NA)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Oncology</td>
<td align="center">342</td>
<td align="center">239 (70%)</td>
<td align="center">6.9 (6.9,7.0)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Plant Sciences</td>
<td align="center">315</td>
<td align="center">235 (75%)</td>
<td align="center">9.8 (8.2,NA)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Environmental Sciences</td>
<td align="center">304</td>
<td align="center">190 (63%)</td>
<td align="center">8.0 (7.6,9.5)</td>
</tr>
<tr><td colspan="4"><hr></hr>
</td>
</tr>
<tr><td align="center">Medicine</td>
<td align="center">293</td>
<td align="center">219 (75%)</td>
<td align="center">13.3 (10.0,NA)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Subjects are assigned to journals and not specific papers. Note that in these models, a given URL could contribute to multiple subjects due to appearing in multiple journals which could also have multiple subject areas. Where possible, specific subjects were generalized (for example, "Computer Science, Interdisciplinary Applications" became "Computer Science"). Median survival estimated using R's survfit(). "NA" indicates that an upper 95% limit was unable to be computed.</p>
</table-wrap-foot>
</table-wrap>
<p>How common are published, scholarly online resources? For WOS, both the percentage of published items which contained a URL as well as their absolute number increased steadily since 1996 as seen in Figure <xref ref-type="fig" rid="F1">1</xref>
. Simple linear fits showed the former's annual increase at a conservative 0.010 % per year with an R<sup>2 </sup>
of 0.98 while the latter's increase was 174 papers with an R<sup>2 </sup>
of 0.97.</p>
<p>A total of 189 (167 unique) DOI URLs were identified, consisting of 1% of the total, while 9 PURLs (8 unique) were identified. Due to cost[<xref ref-type="bibr" rid="B14">14</xref>
], it is likely that DOIs will remain useful for tracking commercially published content though not the scholarly online items independent of those publishers.</p>
<sec><title>URL survival</title>
<p>In order to shed some light on the underlying phenomena of link rot, a survival regression model was fitted with data from the unique URLs. This model, shown in Table <xref ref-type="table" rid="T3">3</xref>
, identified 17 top-level domains, the number of times a URL has been published, a URL's directory structure depth (hereafter referred to as "depth", using the same definition as [<xref ref-type="bibr" rid="B15">15</xref>
]), the number of times the publishing article(s) has been cited, whether articles contain funding text as well as 4 journals as having a significant impact on a URL's lifetime at the P< 0.001 level. This survival regression used the logistic distribution and is interpreted similarly to logistic models. To determine the predicted outcome for a particular URL, one takes the intercept (5.2) and adds to it the coefficients for the individual predictors if those predictors are different from the base level; coefficients here are given in years. If numeric, one first multiplies before adding. The result is then interpreted as the location of the peak of a bell curve for the expected lifetime, instead of a log odds ratio as a regular logistic model would give. Among the two categorical predictors (domains and journals having more than 100 samples), the three having the largest positive impact on lifetimes were the journal <italic>Zoological Studies </italic>
(+16) and the top-level domains <italic>org </italic>
and <italic>dk </italic>
(+8 for both). Though smaller in magnitude than the positive ones, the 3 categorical predictors having the largest negative impact were the journals <italic>Computer Physics Communications </italic>
(-4) and <italic>Bioinformatics </italic>
(-2) as well as the domain <italic>kr </italic>
(-3), though the P values associated with the latter two are more marginal than some of the others (.006 and .02 respectively).</p>
<table-wrap id="T3" position="float"><label>Table 3</label>
<caption><p>Results of fitting a parametric survival regression using the logistic distribution to the unique URLs.</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="center">Variable</th>
<th align="center">Value</th>
<th align="center">p</th>
<th align="center">5%</th>
<th align="center">95%</th>
</tr>
</thead>
<tbody><tr><td align="center">(Intercept)</td>
<td align="center">5.22</td>
<td align="center">3.3E-30</td>
<td align="center">4.46</td>
<td align="center">5.97</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Log2(URL published)</td>
<td align="center">3.57</td>
<td align="center">1.4E-17</td>
<td align="center">2.88</td>
<td align="center">4.25</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">depth</td>
<td align="center">-1.46</td>
<td align="center">7.0E-32</td>
<td align="center">-1.66</td>
<td align="center">-1.25</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Log2(TimesCited + 1)</td>
<td align="center">0.25</td>
<td align="center">2.8E-04</td>
<td align="center">0.13</td>
<td align="center">0.36</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Funding text present</td>
<td align="center">3.43</td>
<td align="center">2.8E-11</td>
<td align="center">2.59</td>
<td align="center">4.28</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center" colspan="5"><bold>Domain</bold>
</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">au</td>
<td align="center">4.53</td>
<td align="center">1.5E-04</td>
<td align="center">2.56</td>
<td align="center">6.49</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">be</td>
<td align="center">3.31</td>
<td align="center">1.9E-02</td>
<td align="center">0.99</td>
<td align="center">5.64</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">ca</td>
<td align="center">4.88</td>
<td align="center">1.7E-06</td>
<td align="center">3.20</td>
<td align="center">6.56</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">ch</td>
<td align="center">6.45</td>
<td align="center">7.2E-08</td>
<td align="center">4.48</td>
<td align="center">8.42</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">cn</td>
<td align="center">1.50</td>
<td align="center">1.3E-01</td>
<td align="center">-0.13</td>
<td align="center">3.13</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">com</td>
<td align="center">6.02</td>
<td align="center">2.2E-18</td>
<td align="center">4.89</td>
<td align="center">7.16</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">de</td>
<td align="center">5.74</td>
<td align="center">6.1E-16</td>
<td align="center">4.57</td>
<td align="center">6.91</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">dk</td>
<td align="center">7.66</td>
<td align="center">5.7E-07</td>
<td align="center">5.14</td>
<td align="center">10.18</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">edu</td>
<td align="center">3.77</td>
<td align="center">1.6E-13</td>
<td align="center">2.93</td>
<td align="center">4.61</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">es</td>
<td align="center">3.05</td>
<td align="center">5.4E-03</td>
<td align="center">1.25</td>
<td align="center">4.85</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">fr</td>
<td align="center">3.65</td>
<td align="center">6.6E-07</td>
<td align="center">2.44</td>
<td align="center">4.85</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">gov</td>
<td align="center">5.51</td>
<td align="center">1.2E-15</td>
<td align="center">4.38</td>
<td align="center">6.64</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">il</td>
<td align="center">5.92</td>
<td align="center">3.6E-04</td>
<td align="center">3.19</td>
<td align="center">8.65</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">in</td>
<td align="center">4.78</td>
<td align="center">2.2E-04</td>
<td align="center">2.65</td>
<td align="center">6.91</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">it</td>
<td align="center">5.51</td>
<td align="center">1.4E-08</td>
<td align="center">3.91</td>
<td align="center">7.11</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">jp</td>
<td align="center">5.07</td>
<td align="center">8.0E-09</td>
<td align="center">3.62</td>
<td align="center">6.51</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">kr</td>
<td align="center">-3.35</td>
<td align="center">2.0E-02</td>
<td align="center">-5.73</td>
<td align="center">-0.97</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">net</td>
<td align="center">7.01</td>
<td align="center">4.2E-11</td>
<td align="center">5.26</td>
<td align="center">8.76</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">nl</td>
<td align="center">6.78</td>
<td align="center">1.1E-06</td>
<td align="center">4.49</td>
<td align="center">9.07</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">org</td>
<td align="center">8.10</td>
<td align="center">2.4E-36</td>
<td align="center">7.04</td>
<td align="center">9.16</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">ru</td>
<td align="center">3.90</td>
<td align="center">2.3E-03</td>
<td align="center">1.80</td>
<td align="center">6.01</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">se</td>
<td align="center">1.71</td>
<td align="center">2.4E-01</td>
<td align="center">-0.69</td>
<td align="center">4.12</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">tw</td>
<td align="center">1.64</td>
<td align="center">1.7E-01</td>
<td align="center">-0.33</td>
<td align="center">3.61</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">uk</td>
<td align="center">4.49</td>
<td align="center">4.2E-12</td>
<td align="center">3.42</td>
<td align="center">5.56</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center" colspan="5"><bold>Source</bold>
</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Bioinformatics</td>
<td align="center">-2.04</td>
<td align="center">5.7E-03</td>
<td align="center">-3.25</td>
<td align="center">-0.83</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">BMC Bioinformatics</td>
<td align="center">2.69</td>
<td align="center">3.9E-05</td>
<td align="center">1.62</td>
<td align="center">3.77</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">BMC Genomics</td>
<td align="center">0.88</td>
<td align="center">4.7E-01</td>
<td align="center">-1.13</td>
<td align="center">2.89</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Comp. Physics Comm.</td>
<td align="center">-4.00</td>
<td align="center">3.0E-05</td>
<td align="center">-5.57</td>
<td align="center">-2.42</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Genome Research</td>
<td align="center">0.56</td>
<td align="center">7.1E-01</td>
<td align="center">-1.92</td>
<td align="center">3.04</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Nucleic Acids Research</td>
<td align="center">1.28</td>
<td align="center">8.6E-04</td>
<td align="center">0.65</td>
<td align="center">1.91</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">PLoS ONE</td>
<td align="center">-0.39</td>
<td align="center">8.0E-01</td>
<td align="center">-2.95</td>
<td align="center">2.18</td>
</tr>
<tr><td colspan="5"><hr></hr>
</td>
</tr>
<tr><td align="center">Zoological Studies</td>
<td align="center">16.42</td>
<td align="center">2.2E-15</td>
<td align="center">13.01</td>
<td align="center">19.83</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p>Positive numbers indicate longer median lifetimes. Much like a logistic model, coefficients can be added to the intercept value (after multiplying in the case of numeric predictors) to obtain a median lifetime. For example, the median expected lifetime for a URL published once, with depth 0, whose publishing article had 1 citation, no funding text, domain au and published in a Journal not listed (ie- in the default) would be: (Intercept) 5.22 + Log2(1)*3.57 + 0*-1.46 + Log2(1+1)*0.25 + 0*3.43 + 4.53 = 10 years</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec><title>Predictors of availability</title>
<p>While examining URL survival and archival, it is not only interesting to ask which factors significantly correlate with a URL lasting but also which account for most of the differences. To that end, we fit logistic models for each of the measured outcomes (live web, Internet Archive and Web Citation availabilities) to help tease out that information. To enhance comparability, a similar list of predictors (differing only in whether the first or last year a URL was published was used) without interaction terms was employed for all 3 methods and unique deviance calculated by dropping each term from the model and measuring the change in residual deviance. Results were then expressed as a percentage of the total uniquely explained deviance and are graphically shown in Figure <xref ref-type="fig" rid="F4">4</xref>
.</p>
<fig id="F4" position="float"><label>Figure 4</label>
<caption><p><bold>How important is each predictor in predicting whether a URL is available?</bold>
 This graph compares what portion of the overall deviance is explained uniquely by each predictor for each of the measured outcomes. A similar list of predictors (differing only in whether the first or last year a URL was published) without interaction terms was employed to construct 3 logistic regression models. The dependent variable for each of the outcomes under study (Live Web, Internet Archive and WebCite) was availability at the time of measurement. Unique deviance was calculated by dropping each term and measuring the change in explained deviance in the logistic model. Results were then expressed as a percentage of the total uniquely explained deviance for each of the 3 methods.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S5-4"></graphic>
</fig>
<p>For live web availability, the most deviance was explained by the last year a URL was published (42%) followed by the domain (26%). That these two predictors are very important agrees with much of the published literature thus far. For the Internet Archive, by far the most important predictor was the URL depth at 45%. Based on this, it stands to reason that the Internet Archive either prefers more popular URLs which happen to be at lower depths or employs an algorithm that prioritizes breadth over depth. Similar to the IA, WC had a single predictor that accounted for much of the explained deviance, with the publishing journal representing 49% of the explained deviance. This may reflect WC's efforts to work with publishers as the model shows one of the announced early adopters, BioMed Central [<xref ref-type="bibr" rid="B7">7</xref>
], as having the two measured journals (BMC Bioinformatics and BMC Genomics) with the highest retention rates. Therefore, WC is biased towards a publication's source (journals).</p>
</sec>
<sec><title>Archive site performance</title>
<p>Another way to measure the effectiveness of the current solutions to link decay is to look at the number of "saved" URLs, or those missing ones that are available through archival engines. Out of the 31% of URLs (33% of the unique) which were not accessible on the live web, 49% of them (47% of the unique) were available in one of the two engines, with IA having 47% (46% unique) and WC having 7% (6% unique). WC's comparatively lower performance can likely be attributed to a combination of its requirement for human interaction and its still-growing adoption.</p>
<p>In order to address the discrepancy, all sites that were still active but not archived were submitted to the engine(s) from which they were missing. Using the information gleaned from probing the sites as well as the archives, URLs missing from one or both of the archives, yet still alive, were submitted programmatically. This included submitting 2,662 to the Wayback Machine as well as 7,477 to WebCite, of which 2,080 and 6,348 were successful, respectively.</p>
</sec>
</sec>
<sec sec-type="discussion"><title>Discussion</title>
<sec><title>Submission of missing URLs to archives</title>
<p>Archiving missing URLs in each of the archival engines had their own special nuances. For the Internet Archive, the lack of a practical documented way of submitting URLs (see <ext-link ext-link-type="uri" xlink:href="http://faq.web.archive.org/my-sites-not-archived-how-can-i-add-it/">http://faq.web.archive.org/my-sites-not-archived-how-can-i-add-it/</ext-link>
) necessitated trusting a message shown by the Wayback Machine when one finds a URL that isn't archived and clicks the "Latest" button. In this instance, the user is sent to the URL "<ext-link ext-link-type="uri" xlink:href="http://liveweb.archive.org/">http://liveweb.archive.org/</ext-link>
" which has a banner proclaiming that the page "will become part of the permanent archive in the next few months". Interestingly, as witnessed by requests for a web page hosted on a server for which the authors could monitor the logs, only those items requested by the client were downloaded. This meant that if only a page's text were fetched, supporting items such as images and CSS files would not be archived. To archive the supporting items and avoid duplicating work, wget's "--page-requisites" option was used instead of a custom parser.</p>
<p>WebCite has an easy-to-use API for submitting URLs, though limitations during the submission of our dataset presented some issues. The biggest issue was WebCite's abuse detection process, which would flag the robot after it had made a certain number of requests. To account for this and be generally nice users, we added logic to ensure a minimum delay between archival requests submitted to both the IA and WC. Exponential delay logic was implemented for WC when encountering general timeouts, other failures (like mysql error messages) or the abuse logic. Eventually, we learned that certain URLs would cause WC's crawler to timeout indefinitely, requiring the implementation of a maximum retry count (and a failure status) if the error wasn't caused by the abuse logic.</p>
<p>To estimate what impact we had on the archives' coverage of the study URLs, we compared a URL survey done directly prior to our submission process to one done afterwards; a period of about 3.5 months. It was assumed that the contribution due to unrelated processes would not be very large given that there was only a modest increase in coverage, 5% for IA and 1% for WC, over the previous period of just under a year and a half.</p>
<p>Each of the two archival engines had interesting behaviors which required gauging successful submission of a URL by whether it was archived as of a subsequent survey rather than using the statuses returned by the engines. For the Internet Archive, it was discovered that an error didn't always indicate failure, as there were 872 URLs for which wget returned an error but which were successfully archived. Conversely, WebCite returned an asynchronous status, such that even in the case of a successful return the URL might fail archival; the case in 955 out of a total of 7,285.</p>
<p>Submitting the 2662 URLs to IA took a little less than a day, whereas submitting 7285 to WC took over 2 months. This likely reflects IA's large server capacity, funding and platform maturity due to its age.</p>
</sec>
<sec><title>Generating the list of unique URLs</title>
<p>Converting some of the potential predictors from the list of published URLs to the list of unique URLs presented some unique issues. In particular, while converting those based on the URL itself (domain, depth, whether alive or in an archive) were straightforward, those which depended upon a publishing article (number of times URL was published, the number of times an article was cited, publishing journal, whether there was funding text) were estimated by collating the data from each publishing. Only a small amount, 8%, of the unique URLs, appeared more than once, and among the measured variables that pertained to the publishing there was not a large amount of variety. Amongst repeatedly-published URLs, 43% appeared in only one journal and the presence of funding text was the same 76% of the time. For calculating the number of times a paper was published, multiple appearances of a URL within a given title/abstract were counted as one. Thus, while efforts were made to provide a representative collated value where appropriate, it's expected that different methods would not have produced significantly different results.</p>
</sec>
<sec><title>Additional sources of error</title>
<p>Even though WOS's index appears to have better quality Optical Character Recognition (OCR) than PubMed, it still has OCR artifacts. To compensate for this, the URL extraction script tried to use some heuristics to detect the most common sources of error and correct them. Some of the biggest sources of error were: randomly inserted spaces in URLs, "similar to" being substituted for the tilde character, periods being replaced with commas and extra punctuation being appended to the URL (sometimes due to the logic added to address the first issue).</p>
<p>Likely the largest contributors to false negatives are errors in OCR and the attempts to compensate for them. In assessing the effectiveness of our submissions to IA, it is possible that the estimate could be understated due to URLs that had been submitted but not yet made available within the Wayback Machine.</p>
<p>Dynamic websites with interactive content, if only present via an archiving engine, would be a source of false positives, as the person accessing the resource would presumably want to use it as opposed to viewing the design work of its landing page. If a published web site goes away and another installed in its place (especially true if a .com or .net domain is allowed to expire), then the program will not be able to tell the difference since it will see a valid (though impertinent) web site. In addition, though page contents can change and lose relevance from their original use[<xref ref-type="bibr" rid="B16">16</xref>
], dates of archival were not compared to the publication date.</p>
<p>Another source of false positive error would be uncaught OCR artifacts that insert spaces within URLs if it truncated the path but left the correct host intact. The result would be a higher probability that the URL would appear as a higher level index page, which are generally more likely to function than pages at lower levels [<xref ref-type="bibr" rid="B11">11</xref>
,<xref ref-type="bibr" rid="B12">12</xref>
].</p>
</sec>
<sec><title>Bibliographic database</title>
<p>Web of Science was chosen because, compared to PubMed, it was more cross-sectional and had better OCR quality based on a small sampling. Many of the other evaluation criteria were similar between PubMed and WOS, as both contain scholarly work and have an interface to download bibliographic data. Interestingly, due to the continued presence of OCR issues in newer articles, it appears that bibliographic information for some journals is not yet passed electronically.</p>
</sec>
</sec>
<sec sec-type="conclusions"><title>Conclusions</title>
<p>Based on the data gathered in this and other studies, it is apparent that there is still a problem with irretrievable scholarly research on the Internet. We found that roughly 50% of URLs published 11 years prior to the survey (in 2000) are still left standing. Interesting is that the rate of decay for late-published URLs (within the past 11 years) appears to be higher than that for the older ones, lending credence to what Koehler suggested about eventual decay rate stabilization[<xref ref-type="bibr" rid="B17">17</xref>
]. Survival rates for living URLs published between 1996 and 1999, inclusive, only vary by 2.4% (1.5% for unique) and have poor linear fits (R<sup>2 </sup>
of .51 and .18 for unique), whereas years [2000, 2010] have linear slope 0.031 and R<sup>2 </sup>
.90 (.036 and R<sup>2 </sup>
.95 for unique URLs using the first published year) indicating that the availability between years for older URLs is much more stable whereas the availability for more recent online resources follow a linear trend with a predictable loss rate. Overall, 84% of URLs (82% of the unique) were available in some manner: either via the web, IA or WC.</p>
<p>Several remedies are available to address different aspects of the link decay problem. For data-based sites that can be archived properly with an engine such as the Internet Archive or WebCite, one remedy is to submit the missing sites which are still alive to the archiving engines. Based on the results of our prototype (illustrated in Figure <xref ref-type="fig" rid="F5">5</xref>
), this method was wildly successful, increasing IA's coverage of the study's URLs by 22% and WebCite's by 255%. Journals could require authors to submit URLs to both the Internet Archive and WebCite, or alternatively programs similar to those employed in this study could be used to do it automatically. Another way to increase archival would be for the owners of published sites to ease restrictions for archiving engines since 507 (352 unique) of the published URLs had archiving disabled via robots.txt according to the Internet Archive. Amongst these, 16% (22% of the unique) have already ceased being valid. While some sites may have good reason for blocking automated archivers (such as dynamic content or licensing issues), there may be others that could remove their restrictions entirely or provide an exception for preservation engines.</p>
<fig id="F5" position="float"><label>Figure 5</label>
<caption><p><bold>Coverage of the scholarly URL list for each archival engine at different times</bold>
. All URLs marked as alive in 2011 but missing from an archive were submitted between the 2012 and 2013 surveys. The effect of submitting the URLs is most evident in the WebCite case though the Internet Archive also showed substantial improvement. Implementing an automated process to do this could vastly improve the retention of scholarly static web pages.</p>
</caption>
<graphic xlink:href="1471-2105-14-S14-S5-5"></graphic>
</fig>
<p>To address the control issue for redirection solutions (DOI, PURL) mentioned in the introduction, those who administer cited tools could begin to maintain and publish a permanent URL on the web site itself. Perhaps an even more radical step would be for either these existing tools or some new tool to take a Wikipedia approach and allow end-users to update and search a database of permanent URLs. Considering the studies that have shown around at least 30% of dead URLs to be locatable using web search engines [<xref ref-type="bibr" rid="B3">3</xref>
,<xref ref-type="bibr" rid="B18">18</xref>
], such a peer-maintained system could be effective and efficient, though spam could be an issue if not properly addressed.</p>
<p>For dynamic websites, the current solutions are more technically involved, potentially expensive and less feasible. These include mirroring (hosting a website on another server, possibly at another institution) and providing access to the source code, both of which require time and effort. Once the source is acquired, it can sometimes take considerable expertise to make use of it as there may be complex libraries or framework configuration, local assumptions hard-coded into the software or it could be written for a different platform (GPU, Unix, Windows, etc.). The efforts to have reproducible research, where the underlying logic and data behind the results of a publication are made available to the greater community, have stated many of the same requirements as preserving dynamic websites [<xref ref-type="bibr" rid="B19">19</xref>
,<xref ref-type="bibr" rid="B20">20</xref>
]. Innovation in this area could thus have multiple benefits beyond just the archival.</p>
</sec>
<sec sec-type="methods"><title>Methods</title>
<sec><title>Data preparation and analysis</title>
<p>The then-current year (2011) was excluded to eliminate bias from certain journals being indexed sooner than others. For analysis and statistical modeling, the R program [<xref ref-type="bibr" rid="B21">21</xref>
] and its "survival" library [<xref ref-type="bibr" rid="B22">22</xref>
] were used (scripts included in Additional file <xref ref-type="supplementary-material" rid="S1">1</xref>
).</p>
<p>Wherever possible, statistics are presented in 2 forms: one representing the raw list of URLs extracted from abstracts and the other representing a deduplicated set of those URLs. The former is most appropriate when thinking about what a researcher would encounter when trying to use a published URL in an article of interest and also serves as a way to give weight to multiply-published URLs. The latter is more appropriate when contemplating scholarly URLs as a whole or when using statistical models that assume independence between samples.</p>
<p>URLs not the goal of this study such as journal promotions and invalid URLs were excluded using computational methods as much as possible in order to minimize subjective bias. The first method, removing 943 (26 unique), looked for identical URLs which comprised a large percentage of a journal's published collection within a given year. Upon manual examination, a decision was then made whether to eliminate them. The second method, which identified 18 invalid URLs (all unique), consisted of checking for WebCitation's "UnexpectedXML" error. These URLs were corrupted to the point that they interfered with XML interpretation of the request due either to an error in our parsing or the OCR.</p>
<p>DOI sites were identified by virtue of containing "<ext-link ext-link-type="uri" xlink:href="http://dx.doi.org">http://dx.doi.org</ext-link>
". PURL sites were identified by virtue of containing "http://purl." in the URL. Interestingly, 3 PURL servers were identified through this mechanism: <ext-link ext-link-type="uri" xlink:href="http://purl.oclc.org">http://purl.oclc.org</ext-link>
, <ext-link ext-link-type="uri" xlink:href="http://purl.org">http://purl.org</ext-link>
 and <ext-link ext-link-type="uri" xlink:href="http://purl.access.gpo.gov">http://purl.access.gpo.gov</ext-link>
.</p>
<p>To make for results more comparable to prior work as well as easier to interpret analysis, a URL was considered available if it successfully responded to at least 90% of the requests and unavailable if less than that. This method is similar to the method used by Wren[<xref ref-type="bibr" rid="B4">4</xref>
], and differs from Ducut's[<xref ref-type="bibr" rid="B1">1</xref>
] by not using a "variable availability" category defined as being available > 0% and < 90% of the time. Our results show that 466 unique URLs (3.2%) would have been in this middle category, a number quite similar to what Wren's and Ducut's would have been (3.4% and 3.2%, respectively). Being such a small percentage of the total, their treatment is not likely to affect analysis much regardless of how they are interpreted. Having binary data also eases interpretation of the statistical models. In addition, due to the low URL counts for 1994 (3) and 1995 (22), these years were excluded from analysis.</p>
<sec><title>Survival model</title>
<p>Survival analysis was chosen to analyze living URLs due to its natural fit; like people, URLs have lifetimes and we are interested in discussing them, what causes them to be longer or shorter and by how much. Lifetimes were calculated by assuming URLs were alive each time they were published, which is a potential source of error [<xref ref-type="bibr" rid="B2">2</xref>
]. Data was coded as either right or left-censored; right-censored since living URLs presumably would die at an unknown time in the future and left-censored because it was unknown when a non-responding URL had died. Ages were coded in months rather than years in order to increase accuracy and precision.</p>
<p>Parametric survival regression models were constructed using R's <italic>survreg()</italic>
. In selecting the distribution to use, all of those available were tried, with the logistical showing the best overall fit based on Akaike Information Criterion (AIC) score. Better fits for two of the numeric predictors (number of citations to a publishing paper and number of times a URL was published) were obtained by taking the base 2 logarithm. Collinearity was checked by calculating the variance inflation factor against a logistic regression fit to the web outcome variable. Overall lifetime estimates were made using the <italic>survfit() </italic>
function from R's survival library.</p>
</sec>
<sec><title>Extracting and testing URLs</title>
<p>To prepare a list of URLs (and their associated data), a collection of bibliographic data was compiled by searching WOS for "http" in the title or abstract, downloading the results (500 at a time), then finally collating them into a single file. A custom program (extract_urls.py in Additional file <xref ref-type="supplementary-material" rid="S1">1</xref>
) was then used to extract the URLs and associated metadata from these, after which 5 positive and 2 negative controls were added. A particular URL was only included once per paper.</p>
<p>With the extracted URLs in hand, another custom program (check_urls_web.py in Additional file <xref ref-type="supplementary-material" rid="S1">1</xref>
) was used to test the availability of the URLs 3 times a day over the course of 30 days, starting April 16, 2011. These times were generated randomly by scheduler.py (included in Additional file <xref ref-type="supplementary-material" rid="S1">1</xref>
), the algorithm guaranteeing that no consecutive runs were closer than 2 hours. A given URL was only visited once per run even if it was published multiple times, saving load on the server and speeding up the total runtime (which averaged about 25 minutes due to use of parallelism). Failure was viewed as anything that caused an exception in python's "urllib2" package (which includes error statuses, like 404), with the exception reason being recorded for later analysis.</p>
<p>While investigating some of the failed fetches, a curious thing was noted: there were URLs that would consistently work with a web browser but not with the Python program or other command line downloaders like wget. After some investigation, it was realized that the web server was denying access to unrecognized User Agent strings. In response, the Python program adopted the User Agent of a regular browser and subsequently reduced the number of failed URLs.</p>
<p>At the end of the live web testing period, a custom program (check_urls_archived.py in Additional file <xref ref-type="supplementary-material" rid="S1">1</xref>
) was used to programmatically query the archive engines on May 23, 2011. For the Internet Archive's Wayback Machine, this was done using an HTTP HEAD request (which saves resources vs. GET) on the URL formed by "<ext-link ext-link-type="uri" xlink:href="http://web.archive.org/web/*/">http://web.archive.org/web/*/</ext-link>
" + . Status was judged by the resulting HTTP status code with 200 meaning success, 404 meaning not archived, 403 signifying a page blocked due to robots.txt and 503 meaning that the server was too busy. Because there were a number of these 503 codes, the script would make up to 4 attempts to access the URL, with increasing back off delays to keep from overloading IA's servers. The end result still contained 18, which were counted as not archived for analysis. For WebCite, the documented API was used. This supports returning XML, a format very suitable to automated parsing [<xref ref-type="bibr" rid="B23">23</xref>
]. For sites containing multiple statuses, any successful archiving was taken as a success.</p>
</sec>
</sec>
</sec>
<sec><title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec><title>Authors' contributions</title>
<p>JH implemented the tools for data acquisition and statistical analysis as well as performed a literature review and drafting of the paper. SXG implemented an initial prototype and provided valuable feedback at every step of the process, including critical revision of this manuscript.</p>
</sec>
<sec sec-type="supplementary-material"><title>Supplementary Material</title>
<supplementary-material content-type="local-data" id="S1"><caption><title>Additional file 1</title>
<p><bold>supplement.zip</bold>
. Contains source code used to perform the study, written in python and R. README.txt contains descriptions for each file.</p>
</caption>
<media xlink:href="1471-2105-14-S14-S5-S1.zip"><caption><p>Click here for file</p>
</caption>
</media>
</supplementary-material>
</sec>
</body>
<back><sec><title>Acknowledgements</title>
<p>The authors would like to thank the South Dakota State University departments of Mathematics & Statistics and Biology & Microbiology for their valuable feedback.</p>
</sec>
<sec><title>Declarations</title>
<p>Publication of this article was funded by the National Institutes of Health [GM083226 to SXG].</p>
<p>This article has been published as part of <italic>BMC Bioinformatics </italic>
Volume 14 Supplement 14, 2013: Proceedings of the Tenth Annual MCBIOS Conference. Discovery in a sea of data. The full contents of the supplement are available online at <ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14">http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14</ext-link>
.</p>
</sec>
<ref-list><ref id="B1"><mixed-citation publication-type="journal"><name><surname>Ducut</surname>
<given-names>E</given-names>
</name>
<name><surname>Liu</surname>
<given-names>F</given-names>
</name>
<name><surname>Fontelo</surname>
<given-names>P</given-names>
</name>
<article-title>An update on Uniform Resource Locator (URL) decay in MEDLINE abstracts and measures for its mitigation</article-title>
<source>BMC Med Inform Decis Mak</source>
<year>2008</year>
<volume>8</volume>
<fpage>-</fpage>
<pub-id pub-id-type="pmid">19117521</pub-id>
</mixed-citation>
</ref>
<ref id="B2"><mixed-citation publication-type="journal"><name><surname>Aronsky</surname>
<given-names>D</given-names>
</name>
<name><surname>Madani</surname>
<given-names>S</given-names>
</name>
<name><surname>Carnevale</surname>
<given-names>RJ</given-names>
</name>
<name><surname>Duda</surname>
<given-names>S</given-names>
</name>
<name><surname>Feyder</surname>
<given-names>MT</given-names>
</name>
<article-title>The prevalence and inaccessibility of Internet references in the biomedical literature at the time of publication</article-title>
<source>J Am Med Inform Assn</source>
<year>2007</year>
<volume>14</volume>
<fpage>232</fpage>
<lpage>234</lpage>
<pub-id pub-id-type="doi">10.1197/jamia.M2243</pub-id>
</mixed-citation>
</ref>
<ref id="B3"><mixed-citation publication-type="journal"><name><surname>Wren</surname>
<given-names>JD</given-names>
</name>
<article-title>URL decay in MEDLINE - a 4-year follow-up study</article-title>
<source>Bioinformatics</source>
<year>2008</year>
<volume>24</volume>
<fpage>1381</fpage>
<lpage>1385</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btn127</pub-id>
<pub-id pub-id-type="pmid">18413326</pub-id>
</mixed-citation>
</ref>
<ref id="B4"><mixed-citation publication-type="journal"><name><surname>Wren</surname>
<given-names>JD</given-names>
</name>
<article-title>404 not found: the stability and persistence of URLs published in MEDLINE</article-title>
<source>Bioinformatics</source>
<year>2004</year>
<volume>20</volume>
<fpage>668</fpage>
<lpage>U208</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg465</pub-id>
<pub-id pub-id-type="pmid">15033874</pub-id>
</mixed-citation>
</ref>
<ref id="B5"><mixed-citation publication-type="journal"><name><surname>Yang</surname>
<given-names>SL</given-names>
</name>
<name><surname>Qiu</surname>
<given-names>JP</given-names>
</name>
<name><surname>Xiong</surname>
<given-names>ZY</given-names>
</name>
<article-title>An empirical study on the utilization of web academic resources in humanities and social sciences based on web citations</article-title>
<source>Scientometrics</source>
<year>2010</year>
<volume>84</volume>
<fpage>1</fpage>
<lpage>19</lpage>
<pub-id pub-id-type="doi">10.1007/s11192-009-0142-7</pub-id>
</mixed-citation>
</ref>
<ref id="B6"><mixed-citation publication-type="other"><article-title>The Internet Archive</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.archive.org/web/web.php">http://www.archive.org/web/web.php</ext-link>
</mixed-citation>
</ref>
<ref id="B7"><mixed-citation publication-type="journal"><name><surname>Eysenbach</surname>
<given-names>G</given-names>
</name>
<name><surname>Trudell</surname>
<given-names>M</given-names>
</name>
<article-title>Going, going, still there: Using the WebCite service to permanently archive cited web pages</article-title>
<source>Journal of Medical Internet Research</source>
<year>2005</year>
<volume>7</volume>
<fpage>2</fpage>
<lpage>6</lpage>
<pub-id pub-id-type="doi">10.2196/jmir.7.1.e2</pub-id>
</mixed-citation>
</ref>
<ref id="B8"><mixed-citation publication-type="other"><article-title>The DOI System</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.doi.org/">http://www.doi.org/</ext-link>
</mixed-citation>
</ref>
<ref id="B9"><mixed-citation publication-type="other"><article-title>PURL Home Page</article-title>
<ext-link ext-link-type="uri" xlink:href="http://purl.org">http://purl.org</ext-link>
</mixed-citation>
</ref>
<ref id="B10"><mixed-citation publication-type="other"><article-title>Key Facts on Digital Object identifier System</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.doi.org/factsheets/DOIKeyFacts.html">http://www.doi.org/factsheets/DOIKeyFacts.html</ext-link>
</mixed-citation>
</ref>
<ref id="B11"><mixed-citation publication-type="journal"><name><surname>Wren</surname>
<given-names>JD</given-names>
</name>
<name><surname>Johnson</surname>
<given-names>KR</given-names>
</name>
<name><surname>Crockett</surname>
<given-names>DM</given-names>
</name>
<name><surname>Heilig</surname>
<given-names>LF</given-names>
</name>
<name><surname>Schilling</surname>
<given-names>LM</given-names>
</name>
<name><surname>Dellavalle</surname>
<given-names>RP</given-names>
</name>
<article-title>Uniform resource locator decay in dermatology journals - Author attitudes and preservation practices</article-title>
<source>Arch Dermatol</source>
<year>2006</year>
<volume>142</volume>
<fpage>1147</fpage>
<lpage>1152</lpage>
<pub-id pub-id-type="doi">10.1001/archderm.142.9.1147</pub-id>
<pub-id pub-id-type="pmid">16983002</pub-id>
</mixed-citation>
</ref>
<ref id="B12"><mixed-citation publication-type="journal"><name><surname>Casserly</surname>
<given-names>MF</given-names>
</name>
<name><surname>Bird</surname>
<given-names>JE</given-names>
</name>
<article-title>Web citation availability: Analysis and implications for scholarship</article-title>
<source>College & Research Libraries</source>
<year>2003</year>
<volume>64</volume>
<fpage>300</fpage>
<lpage>317</lpage>
<pub-id pub-id-type="pmid">23991424</pub-id>
</mixed-citation>
</ref>
<ref id="B13"><mixed-citation publication-type="other"><article-title>EZID: Pricing</article-title>
<ext-link ext-link-type="uri" xlink:href="http://n2t.net/ezid/home/pricing">http://n2t.net/ezid/home/pricing</ext-link>
</mixed-citation>
</ref>
<ref id="B14"><mixed-citation publication-type="journal"><name><surname>Wagner</surname>
<given-names>C</given-names>
</name>
<name><surname>Gebremichael</surname>
<given-names>MD</given-names>
</name>
<name><surname>Taylor</surname>
<given-names>MK</given-names>
</name>
<name><surname>Soltys</surname>
<given-names>MJ</given-names>
</name>
<article-title>Disappearing act: decay of uniform resource locators in health care management journals</article-title>
<source>J Med Libr Assoc</source>
<year>2009</year>
<volume>97</volume>
<fpage>122</fpage>
<lpage>130</lpage>
<pub-id pub-id-type="doi">10.3163/1536-5050.97.2.009</pub-id>
<pub-id pub-id-type="pmid">19404503</pub-id>
</mixed-citation>
</ref>
<ref id="B15"><mixed-citation publication-type="journal"><name><surname>Koehler</surname>
<given-names>W</given-names>
</name>
<article-title>An analysis of Web page and Web site constancy and permanence</article-title>
<source>J Am Soc Inf Sci</source>
<year>1999</year>
<volume>50</volume>
<fpage>162</fpage>
<lpage>180</lpage>
<pub-id pub-id-type="doi">10.1002/(SICI)1097-4571(1999)50:2<162::AID-ASI7>3.0.CO;2-B</pub-id>
</mixed-citation>
</ref>
<ref id="B16"><mixed-citation publication-type="journal"><name><surname>Bar-Ilan</surname>
<given-names>J</given-names>
</name>
<name><surname>Peritz</surname>
<given-names>BC</given-names>
</name>
<article-title>Evolution, continuity, and disappearance of documents on a specific topic on the web: A longitudinal study of "informetrics"</article-title>
<source>Journal of the American Society for Information Science and Technology</source>
<year>2004</year>
<volume>55</volume>
<fpage>980</fpage>
<lpage>990</lpage>
<pub-id pub-id-type="doi">10.1002/asi.20049</pub-id>
</mixed-citation>
</ref>
<ref id="B17"><mixed-citation publication-type="journal"><name><surname>Koehler</surname>
<given-names>W</given-names>
</name>
<article-title>A longitudinal study of Web pages continued: a consideration of document persistence</article-title>
<source>Information Research-an International Electronic Journal</source>
<year>2004</year>
<volume>9</volume>
<fpage>-</fpage>
</mixed-citation>
</ref>
<ref id="B18"><mixed-citation publication-type="journal"><name><surname>Casserly</surname>
<given-names>MF</given-names>
</name>
<name><surname>Bird</surname>
<given-names>JE</given-names>
</name>
<article-title>Web citation availability - A follow-up study</article-title>
<source>Libr Resour Tech Ser</source>
<year>2008</year>
<volume>52</volume>
<fpage>42</fpage>
<lpage>53</lpage>
<pub-id pub-id-type="doi">10.5860/lrts.52n1.42</pub-id>
</mixed-citation>
</ref>
<ref id="B19"><mixed-citation publication-type="journal"><name><surname>Peng</surname>
<given-names>RD</given-names>
</name>
<article-title>Reproducible research and Biostatistics</article-title>
<source>Biostatistics</source>
<year>2009</year>
<volume>10</volume>
<fpage>405</fpage>
<lpage>408</lpage>
<pub-id pub-id-type="doi">10.1093/biostatistics/kxp014</pub-id>
<pub-id pub-id-type="pmid">19535325</pub-id>
</mixed-citation>
</ref>
<ref id="B20"><mixed-citation publication-type="journal"><name><surname>Ince</surname>
<given-names>DC</given-names>
</name>
<name><surname>Hatton</surname>
<given-names>L</given-names>
</name>
<name><surname>Graham-Cumming</surname>
<given-names>J</given-names>
</name>
<article-title>The case for open computer programs</article-title>
<source>Nature</source>
<year>2012</year>
<volume>482</volume>
<fpage>485</fpage>
<lpage>488</lpage>
<pub-id pub-id-type="doi">10.1038/nature10836</pub-id>
<pub-id pub-id-type="pmid">22358837</pub-id>
</mixed-citation>
</ref>
<ref id="B21"><mixed-citation publication-type="other"><collab>R Development Core Team</collab>
<article-title>R: A Language and Environment for Statistical Computing</article-title>
<source>Book R: A Language and Environment for Statistical Computing</source>
<year>2011</year>
<comment>City: R Foundation for Statistical Computing</comment>
</mixed-citation>
</ref>
<ref id="B22"><mixed-citation publication-type="book"><name><surname>Therneau</surname>
<given-names>T</given-names>
</name>
<article-title>A Package for Survival Analysis in S</article-title>
<source>Book A Package for Survival Analysis in S</source>
<year>2012</year>
<edition>2.36-12</edition>
<comment>City</comment>
</mixed-citation>
</ref>
<ref id="B23"><mixed-citation publication-type="other"><article-title>WebCite Technical Background and Best Practices Guide</article-title>
<ext-link ext-link-type="uri" xlink:href="http://www.webcitation.org/doc/WebCiteBestPracticesGuide.pdf">http://www.webcitation.org/doc/WebCiteBestPracticesGuide.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="B24"><mixed-citation publication-type="journal"><name><surname>Markwell</surname>
<given-names>J</given-names>
</name>
<name><surname>Brooks</surname>
<given-names>DW</given-names>
</name>
<article-title>"Link rot" limits the usefulness of web-based educational materials in biochemistry and molecular biology</article-title>
<source>Biochemistry and Molecular Biology Education</source>
<year>2003</year>
<volume>31</volume>
<fpage>69</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="doi">10.1002/bmb.2003.494031010165</pub-id>
</mixed-citation>
</ref>
<ref id="B25"><mixed-citation publication-type="journal"><name><surname>Thorp</surname>
<given-names>AW</given-names>
</name>
<name><surname>Brown</surname>
<given-names>L</given-names>
</name>
<article-title>Accessibility of internet references in Annals of Emergency Medicine: Is it time to require archiving?</article-title>
<source>Ann Emerg Med</source>
<year>2007</year>
<volume>50</volume>
<fpage>188</fpage>
<lpage>192</lpage>
<pub-id pub-id-type="doi">10.1016/j.annemergmed.2006.11.019</pub-id>
<pub-id pub-id-type="pmid">17276549</pub-id>
</mixed-citation>
</ref>
<ref id="B26"><mixed-citation publication-type="journal"><name><surname>Carnevale</surname>
<given-names>RJ</given-names>
</name>
<name><surname>Aronsky</surname>
<given-names>D</given-names>
</name>
<article-title>The life and death of URLs in five biomedical informatics journals</article-title>
<source>International Journal of Medical Informatics</source>
<year>2007</year>
<volume>76</volume>
<fpage>269</fpage>
<lpage>273</lpage>
<pub-id pub-id-type="doi">10.1016/j.ijmedinf.2005.12.001</pub-id>
<pub-id pub-id-type="pmid">16458066</pub-id>
</mixed-citation>
</ref>
<ref id="B27"><mixed-citation publication-type="journal"><name><surname>Dimitrova</surname>
<given-names>DV</given-names>
</name>
<name><surname>Bugeja</surname>
<given-names>M</given-names>
</name>
<article-title>Consider the source: Predictors of online citation permanence in communication journals</article-title>
<source>Portal-Libraries and the Academy</source>
<year>2006</year>
<volume>6</volume>
<fpage>269</fpage>
<lpage>283</lpage>
<pub-id pub-id-type="doi">10.1353/pla.2006.0032</pub-id>
</mixed-citation>
</ref>
<ref id="B28"><mixed-citation publication-type="journal"><name><surname>Duda</surname>
<given-names>JJ</given-names>
</name>
<name><surname>Camp</surname>
<given-names>RJ</given-names>
</name>
<article-title>Ecology in the information age: patterns of use and attrition rates of internet-based citations in ESA journals, 1997-2005</article-title>
<source>Frontiers in Ecology and the Environment</source>
<year>2008</year>
<volume>6</volume>
<fpage>145</fpage>
<lpage>151</lpage>
<pub-id pub-id-type="doi">10.1890/070022</pub-id>
</mixed-citation>
</ref>
<ref id="B29"><mixed-citation publication-type="journal"><name><surname>Rhodes</surname>
<given-names>S</given-names>
</name>
<article-title>Breaking Down Link Rot: The Chesapeake Project Legal Information Archive's Examination of URL Stability</article-title>
<source>Law Library Journal</source>
<year>2010</year>
<volume>102</volume>
<fpage>581</fpage>
<lpage>597</lpage>
</mixed-citation>
</ref>
<ref id="B30"><mixed-citation publication-type="journal"><name><surname>Goh</surname>
<given-names>DHL</given-names>
</name>
<name><surname>Ng</surname>
<given-names>PK</given-names>
</name>
<article-title>Link decay in leading information science journals</article-title>
<source>Journal of the American Society for Information Science and Technology</source>
<year>2007</year>
<volume>58</volume>
<fpage>15</fpage>
<lpage>24</lpage>
<pub-id pub-id-type="doi">10.1002/asi.20513</pub-id>
</mixed-citation>
</ref>
<ref id="B31"><mixed-citation publication-type="journal"><name><surname>Russell</surname>
<given-names>E</given-names>
</name>
<name><surname>Kane</surname>
<given-names>J</given-names>
</name>
<article-title>The missing link - Assessing the reliability of Internet citations in history journals</article-title>
<source>Technology and Culture</source>
<year>2008</year>
<volume>49</volume>
<fpage>420</fpage>
<lpage>429</lpage>
<pub-id pub-id-type="doi">10.1353/tech.0.0028</pub-id>
</mixed-citation>
</ref>
<ref id="B32"><mixed-citation publication-type="journal"><name><surname>Dellavalle</surname>
<given-names>RP</given-names>
</name>
<name><surname>Hester</surname>
<given-names>EJ</given-names>
</name>
<name><surname>Heilig</surname>
<given-names>LF</given-names>
</name>
<name><surname>Drake</surname>
<given-names>AL</given-names>
</name>
<name><surname>Kuntzman</surname>
<given-names>JW</given-names>
</name>
<name><surname>Graber</surname>
<given-names>M</given-names>
</name>
<name><surname>Schilling</surname>
<given-names>LM</given-names>
</name>
<article-title>Information science - Going, going, gone: Lost Internet references</article-title>
<source>Science</source>
<year>2003</year>
<volume>302</volume>
<fpage>787</fpage>
<lpage>788</lpage>
<pub-id pub-id-type="doi">10.1126/science.1088234</pub-id>
<pub-id pub-id-type="pmid">14593153</pub-id>
</mixed-citation>
</ref>
<ref id="B33"><mixed-citation publication-type="journal"><name><surname>Evangelou</surname>
<given-names>E</given-names>
</name>
<name><surname>Trikalinos</surname>
<given-names>TA</given-names>
</name>
<name><surname>Ioannidis</surname>
<given-names>JPA</given-names>
</name>
<article-title>Unavailability of online supplementary scientific information from articles published in major journals</article-title>
<source>Faseb Journal</source>
<year>2005</year>
<volume>19</volume>
<fpage>1943</fpage>
<lpage>1944</lpage>
<pub-id pub-id-type="doi">10.1096/fj.05-4784lsf</pub-id>
<pub-id pub-id-type="pmid">16319137</pub-id>
</mixed-citation>
</ref>
<ref id="B34"><mixed-citation publication-type="journal"><name><surname>Sellitto</surname>
<given-names>C</given-names>
</name>
<article-title>The impact of impermanent web-located citations: A study of 123 scholarly conference publications</article-title>
<source>Journal of the American Society for Information Science and Technology</source>
<year>2005</year>
<volume>56</volume>
<fpage>695</fpage>
<lpage>703</lpage>
<pub-id pub-id-type="doi">10.1002/asi.20159</pub-id>
</mixed-citation>
</ref>
<ref id="B35"><mixed-citation publication-type="journal"><name><surname>Bar-Ilan</surname>
<given-names>J</given-names>
</name>
<name><surname>Peritz</surname>
<given-names>B</given-names>
</name>
<article-title>The lifespan of "informetrics" on the Web: An eight year study (1998-2006)</article-title>
<source>Scientometrics</source>
<year>2009</year>
<volume>79</volume>
<fpage>7</fpage>
<lpage>25</lpage>
<pub-id pub-id-type="doi">10.1007/s11192-009-0401-7</pub-id>
</mixed-citation>
</ref>
<ref id="B36"><mixed-citation publication-type="other"><name><surname>Gomes</surname>
<given-names>D</given-names>
</name>
<name><surname>Silva</surname>
<given-names>MJ</given-names>
</name>
<article-title>Modelling Information Persistence on the Web</article-title>
<source>Book Modelling Information Persistence on the Web</source>
<year>2006</year>
<comment>City</comment>
</mixed-citation>
</ref>
<ref id="B37"><mixed-citation publication-type="journal"><name><surname>Markwell</surname>
<given-names>J</given-names>
</name>
<name><surname>Brooks</surname>
<given-names>DW</given-names>
</name>
<article-title>Evaluating web-based information: Access and accuracy</article-title>
<source>Journal of Chemical Education</source>
<year>2008</year>
<volume>85</volume>
<fpage>458</fpage>
<lpage>459</lpage>
<pub-id pub-id-type="doi">10.1021/ed085p458</pub-id>
</mixed-citation>
</ref>
<ref id="B38"><mixed-citation publication-type="journal"><name><surname>Wu</surname>
<given-names>ZQ</given-names>
</name>
<article-title>An empirical study of the accessibility of web references in two Chinese academic journals</article-title>
<source>Scientometrics</source>
<year>2009</year>
<volume>78</volume>
<fpage>481</fpage>
<lpage>503</lpage>
<pub-id pub-id-type="doi">10.1007/s11192-007-1951-1</pub-id>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Dakota du Sud</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Dakota du Sud"><name sortKey="Hennessey, Jason" sort="Hennessey, Jason" uniqKey="Hennessey J" first="Jason" last="Hennessey">Jason Hennessey</name>
</region>
<name sortKey="Ge, Steven Xijin" sort="Ge, Steven Xijin" uniqKey="Ge S" first="Steven Xijin" last="Ge">Steven Xijin Ge</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Ncbi/Merge

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000176 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd -nk 000176 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Ncbi
   |étape=   Merge
   |type=    RBID
   |clé=     PMC:3851533
   |texte=   A cross disciplinary study of link decay and the effectiveness of mitigation techniques
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/RBID.i   -Sk "pubmed:24266891" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

A cross disciplinary study of link decay and the effectiveness of mitigation techniques

A cross disciplinary study of link decay and the effectiveness of mitigation techniques

Source :

Abstract

Links toward previous steps (curation, corpus...)

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki