Link rot: Difference between revisions

Content deleted Content added

Inline

Revision as of 07:57, 14 September 2019

Link rot (or linkrot) is the process by which hyperlinks on individual websites or the Internet in general tend to point to web pages, servers or other resources that have become permanently unavailable.^[1] There is no reliable data on how long web pages and other resources survive: the estimates vary dramatically between different studies, as well as between different sets of links on which these studies are based (see the Prevalence section).

Terminology

Link rot is also called "link death", "link breaking", or "reference rot". A link that does not work any more is called a "broken link", "dead link", or "dangling link". Formally, this is a form of dangling reference: the target of the reference no longer exists.

Causes

One of the most common reasons for a broken link is that the target web page no longer exists. This frequently results in a 404 error, which indicates that the web server responded but the specific page could not be found. Another type of dead link occurs when the server that hosts the target page stops working, is removed from service, or relocates to a new domain name. The browser may return a DNS error or display a site unrelated to the content sought. The latter can occur when a domain name lapses and is subsequently registered by another party. Other reasons for broken links include:

Websites can be restructured or redesigned, or the underlying technology can be changed, altering or invalidating large numbers of inbound or internal links.
Many news sites keep articles freely accessible for only a short time period, and then move them behind a paywall. This causes a significant loss of supporting links in sites discussing news events and using media sites as references.
Content may automatically expire after a certain time period.
Content may be intentionally removed by the owner.
A server may be upgraded and code (e.g. PHP) may no longer function properly.
Links may be removed as a result of legal action or court order.
Search results from social media such as Facebook and Tumblr are prone to link rot because of frequent changes in user privacy, the deletion of accounts, search results pointing to a dynamic page that has new results that differ from the cached result, or the deletion of links or photos.
Links can contain ephemeral, user-specific information such as session or login data. Because these are not universally valid, the result can be a broken link.
A link might be broken because of some form of blocking such as content filters or firewalls.
A website may be closed or taken down, invalidating the links which are pointing to it.
A website might change its domain name. Links pointing to the old name might then become invalid.
Dead links can occur on the authoring side, when website content is assembled from Internet sources and deployed without properly verifying the link targets.
As new private gTLDs became popular, top level domains like .mcdonalds or .xperia were revoked.^[2]

Prevalence

The 404 "Not Found" response is so ubiquitous that it is familiar to even the occasional web user. A number of studies have examined the prevalence of link rot on the web, in academic literature, and in digital libraries.^[3] In a 2003 experiment, it was discovered that about one link out of every 200 disappeared each week from the Internet.^[4] In another study in 2005, it was discovered that half of the URLs cited in D-Lib Magazine articles were no longer accessible 10 years after publication,^[5] and other studies have shown link rot in academic literature to be even worse.^[6]^[7] A 2002 study examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.^[8] In 2014, bookmarking site Pinboard's owner Maciej Cegłowski reported a "pretty steady rate" of 5% link rot per year.^[9] A study of the links in Yahoo! directory showed in 2016–2017 (shortly after Yahoo! stopped publishing this directory) that half-life of the links at the time was around two years.^[10]

Some research in fairly early stages of World Wide Web existence (late 1990s – early 2000s) showed dramatically different half-lives (by more than one order of magnitude) between different collections of links.^[11]

A 2014 Harvard Law School study by Jonathan Zittrain, Kendra Albert and Lawrence Lessig, determined that approximately 50% of the URLs in U.S. Supreme Court opinions no longer link to the original information.^[1] They also found that in a selection of legal journals published between 1999 and 2011, more than 70% of the links no longer functioned as intended. A 2013 study in BMC Bioinformatics analyzed nearly 15,000 links in abstracts from Thomson Reuters's Web of Science citation index and found that the median lifespan of web pages was 9.3 years, and just 62% were archived.^[12] In August 2015 Weblock analyzed more than 180,000 links from references in the full-text corpora of three major open access publishers and found that overall 24.5% of links cited were no longer available.^[13]

Discovering

Discovering broken links might be done manually or automatically. Automated methods, including plug-ins for WordPress, Drupal and other content management system can be used to detect the presence of broken URLs. An alternative is using a specific broken link checker like Xenu's Link Sleuth. However, if a URL returns an HTTP 200 (OK) response, it may be accessible, but the contents of the page could have changed and may no longer be relevant. So manual checking links seems to be a must. Some web servers also return a soft 404, reporting to computers that the link works even though it doesn't. In a study published in 2004, a heuristic was developed for detecting soft 404s.^[14]

Combating

There are numerous solutions for tackling broken links: Some work to prevent them in the first place, while others try to resolve them when they have occurred. There are also numerous tools that have been developed to help combat link rot.

Authoring

Recommendations, some of which are common-sense, include:

Carefully select and implement hyperlinks, and verify them regularly after publication. Best practices include linking to primary rather than secondary sources and prioritizing stable sites.^{[citation needed]} Authors of one study suggest avoiding URL citations that point to resources on researchers' personal pages.^[5]
Always look for the most compact and direct URL available, and ensure that it's clean, with no unnecessary information after the core of the URL.^[15] This process is often referred to as URL normalization or URL canonicalization.
Whenever possible, use persistent identifiers (URLs designed for durability) such as ARKs, DOIs, Handle System references, and PURLs.
Avoid linking to PDF documents if possible. Because PDFs are documents rather than web pages, their content can change without notice, and their names are more likely to contain characters such as spaces that must be translated into safe codes for URLs. Large PDFs may also download slowly and cause a timeout error.^[15]
Avoid linking to pages deep in a website, a practice known as deep linking.
Use web archiving services (for example, WebCite) to permanently archive and retrieve cited Internet references.^[16]

Server side

Never change URLs and never remove pages. If there is a reason to no longer have a page, such as a news site redacting a story, replace it with a message explaining its removal.
When URLs change, use redirection mechanisms such as "301: Moved Permanently" to automatically refer browsers and crawlers to the new location.
Content management systems may offer built-in solutions to the management of links, such as updating them when content is changed or moved on a site.
WordPress guards against link rot by replacing non-canonical URLs with their canonical versions.^[17]
IBM's Peridot attempts to automatically fix broken links.
Permalinking stops broken links by guaranteeing that the content will not move for the foreseeable future. Another form of permalinking is linking to a permalink that then redirects to the actual content, ensuring that even though the real content may be moved etc., links pointing to the resources stay intact.
Design URLs – for example, semantic URLs – such that they won't need to change when a different person takes over maintenance of a document or when different software is used on the server.^[18]

User side

The Linkgraph widget gets the URL of the correct page based upon the old broken URL by using historical location information.
The Google 404 Widget attempts to "guess" the correct URL, and also provides the user with a search box to find the correct page.
When a user receives a 404 response, the Google Toolbar attempts to assist the user in finding the missing page.^[19]

Web archiving

To combat link rot, web archivists are actively engaged in collecting the Web or particular portions of the Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. The goal of the Internet Archive is to maintain an archive of the entire Web, taking periodic snapshots of pages that can then be accessed for free via the Wayback Machine. In January 2013 the company announced that it had reached the milestone of 240 billion archived URLs.^[20] National libraries, national archives and other organizations are also involved in archiving culturally important Web content.

Individuals may use a number of tools that allow them to archive web resources that may go missing in the future:

The "WayBack Machine", at the Internet Archive,^[21] is a free website that archives old web pages. It does not archive websites whose owners have stated they do not want their website archived.
WebCite, a tool specifically for scholarly authors, journal editors and publishers to permanently archive "on-demand" and retrieve cited Internet references.^[16]
Archive.is, an archive site which stores snapshots of web pages. It retrieves one page at a time, but unlike WebCite, it includes Web 2.0 sites such as Google Maps and Twitter.
Perma.cc, which is supported by the Harvard Law School together with a broad coalition of university libraries, takes a snapshot of a URL's content and returns a permanent link.^[1]
The Hiberlink project, a collaboration between the University of Edinburgh, the Los Alamos National Laboratory and others, is working to measure “reference rot” in online academic articles, and also to what extent Web content has been archived.^[22] A related project, Memento, has established a technical standard for accessing online content as it existed in the past.^[23]
Some social bookmarking websites allow users to make online clones of any web page on the internet, creating a copy at an independent url which remains online even if the original page goes down.
Amber, created by the Harvard Berkman Center, is a tool built to fight link rot through archiving links on Wordpress and Drupal sites to prevent web censorship and bolster content preservation.^[24]

However, such preserving systems may encounter on and off service interruption so that the preserved URLs are intermittently unavailable.^[25]

References

^ ^a ^b ^c Zittrain, Jonathan; Albert, Kendra; Lessig, Lawrence (12 June 2014). "Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations". Legal Information Management. 14 (2): 88–99. doi:10.1017/S1472669614000255.
^ "The death of a TLD". blog.benjojo.co.uk. Archived from the original on 2018-07-26. Retrieved 2018-07-27. {{cite web}}: Unknown parameter |dead-url= ignored (|url-status= suggested) (help)
^ Habibzadeh, P. (2013). "Decay of References to Web sites in Articles Published in General Medical Journals: Mainstream vs Small Journals". Applied Clinical Informatics. 4 (4): 455–464. doi:10.4338/aci-2013-07-ra-0055. PMC 3885908. PMID 24454575.
^ Fetterly, Dennis; Manasse, Mark; Najork, Marc; Wiener, Janet (2003). "A large-scale study of the evolution of web pages". Proceedings of the 12th international conference on World Wide Web. Retrieved 14 September 2010. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ ^a ^b McCown, Frank; Chan, Sheffan; Nelson, Michael L.; Bollen, Johan (2005). "The Availability and Persistence of Web References in D-Lib Magazine" (PDF). Proceedings of the 5th International Web Archiving Workshop and Digital Preservation (IWAW'05). {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ Spinellis, Diomidis (2003). "The Decay and Failures of Web References". Communications of the ACM. 46 (1): 71–77. CiteSeerX 10.1.1.12.9599. doi:10.1145/602421.602422.
^ Lawrence, Steve; Pennock, David M.; Flake, Gary William; Krovetz, Robert; Coetzee, Frans M.; Glover, Eric; Nielsen, Finn Arup; Kruger, Andries; Giles, C. Lee (2001). "Persistence of Web References in Scientific Research". Computer. 34 (2): 26–31. CiteSeerX 10.1.1.97.9695. doi:10.1109/2.901164.
^ Nelson, Michael L.; Allen, B. Danette (2002). "Object Persistence and Availability in Digital Libraries". D-Lib Magazine. 8 (1). doi:10.1045/january2002-nelson.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ Cegłowski, Maciej (9 September 2014). "Web Design: The First 100 Years". Archived from the original on 22 July 2015. Retrieved 22 July 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ van der Graaf, Hans. "The half-life of a link is two year". ZOMDir's blog. Archived from the original on 2017-10-17. Retrieved 2019-01-31. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ Koehler, Wallace (2004). "A longitudinal study of web pages continued: a consideration of document persistence". Information Research. 9 (2). Archived from the original on 2017-09-11. Retrieved 2019-01-31. {{cite journal}}: Unknown parameter |dead-url= ignored (|url-status= suggested) (help)
^ Hennessey, Jason; Xijin Ge, Steven (2013). "A Cross Disciplinary Study of Link Decay and the Effectiveness of Mitigation Techniques". BMC Bioinformatics. 14: S5. doi:10.1186/1471-2105-14-S14-S5. PMC 3851533. PMID 24266891. Archived from the original on 21 January 2015. Retrieved 16 January 2015. {{cite journal}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)CS1 maint: unflagged free DOI (link)
^ "All-Time Weblock Report". August 2015. Archived from the original on 4 March 2016. Retrieved 12 January 2016. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ Bar-Yossef, Ziv; Broder, Andrei Z.; Kumar, Ravi; Tomkins, Andrew (2004). "Sic transit gloria telae: towards an understanding of the Web's decay". Proceedings of the 13th international conference on World Wide Web – WWW '04. pp. 328–337. CiteSeerX 10.1.1.1.9406. doi:10.1145/988672.988716. ISBN 978-1581138443. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ ^a ^b Kille, Leighton Walter (8 November 2014). "The Growing Problem of Internet "Link Rot" and Best Practices for Media and Online Publishers". Journalist's Resource, Harvard Kennedy School. Archived from the original on 12 January 2015. Retrieved 16 January 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ ^a ^b Eysenbach, Gunther; Trudel, Mathieu (2005). "Going, going, still there: Using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research. 7 (5): e60. doi:10.2196/jmir.7.5.e60. PMC 1550686. PMID 16403724.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ Rønn-Jensen, Jesper (2007-10-05). "Software Eliminates User Errors And Linkrot". Justaddwater.dk. Archived from the original on 11 October 2007. Retrieved 5 October 2007. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ Berners-Lee, Tim (1998). "Cool URIs Don't Change". Archived from the original on 2000-03-02. Retrieved 2019-01-31. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ Mueller, John (2007-12-14). "FYI on Google Toolbar's Latest Features". Google Webmaster Central Blog. Archived from the original on 13 September 2008. Retrieved 9 July 2008. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ "Wayback Machine: Now with 240,000,000,000 URLs | Internet Archive Blogs". 2013-01-09. Archived from the original on 2017-09-12. Retrieved 2014-04-16. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ "Internet Archive: Digital Library of Free Books, Movies, Music & Wayback Machine". 2001-03-10. Archived from the original on 26 January 1997. Retrieved 7 October 2013. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ "Hiberlink". Hiberlink.org. Archived from the original on 29 January 2015. Retrieved 15 January 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ "Memento: Time Travel for the Web". Memento. Archived from the original on 7 January 2015. Retrieved 15 January 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ "Harvard University's Berkman Center Releases Amber, a "Mutual Aid" Tool for Bloggers & Website Owners to Help Keep the Web Available | Berkman Center". cyber.law.harvard.edu. Archived from the original on 2016-02-02. Retrieved 2016-01-28. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ Habibzadeh, Parham (2015-07-30). "Are current archiving systems reliable enough?". International Urogynecology Journal. 26 (10): 1553. doi:10.1007/s00192-015-2805-7. ISSN 0937-3462. PMID 26224384.

External links

Future-Proofing Your URIs
Jakob Nielsen, "Fighting Linkrot", Jakob Nielsen's Alertbox, June 14, 1998.

[permacc-1] Zittrain, Jonathan; Albert, Kendra; Lessig, Lawrence (12 June 2014). "Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations". Legal Information Management. 14 (2): 88–99. doi:10.1017/S1472669614000255.

[2] "The death of a TLD". blog.benjojo.co.uk. Archived from the original on 2018-07-26. Retrieved 2018-07-27. {{cite web}}: Unknown parameter |dead-url= ignored (|url-status= suggested) (help)

[Habibzadeh2013-3] Habibzadeh, P. (2013). "Decay of References to Web sites in Articles Published in General Medical Journals: Mainstream vs Small Journals". Applied Clinical Informatics. 4 (4): 455–464. doi:10.4338/aci-2013-07-ra-0055. PMC 3885908. PMID 24454575.

[Fetterly2003-4] Fetterly, Dennis; Manasse, Mark; Najork, Marc; Wiener, Janet (2003). "A large-scale study of the evolution of web pages". Proceedings of the 12th international conference on World Wide Web. Retrieved 14 September 2010. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

[McCown2005-5] McCown, Frank; Chan, Sheffan; Nelson, Michael L.; Bollen, Johan (2005). "The Availability and Persistence of Web References in D-Lib Magazine" (PDF). Proceedings of the 5th International Web Archiving Workshop and Digital Preservation (IWAW'05). {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

[Spinellis2003-6] Spinellis, Diomidis (2003). "The Decay and Failures of Web References". Communications of the ACM. 46 (1): 71–77. CiteSeerX 10.1.1.12.9599. doi:10.1145/602421.602422.

[Lawrence2001-7] Lawrence, Steve; Pennock, David M.; Flake, Gary William; Krovetz, Robert; Coetzee, Frans M.; Glover, Eric; Nielsen, Finn Arup; Kruger, Andries; Giles, C. Lee (2001). "Persistence of Web References in Scientific Research". Computer. 34 (2): 26–31. CiteSeerX 10.1.1.97.9695. doi:10.1109/2.901164.

[Nelson2002-8] Nelson, Michael L.; Allen, B. Danette (2002). "Object Persistence and Availability in Digital Libraries". D-Lib Magazine. 8 (1). doi:10.1045/january2002-nelson.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[9] Cegłowski, Maciej (9 September 2014). "Web Design: The First 100 Years". Archived from the original on 22 July 2015. Retrieved 22 July 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[10] van der Graaf, Hans. "The half-life of a link is two year". ZOMDir's blog. Archived from the original on 2017-10-17. Retrieved 2019-01-31. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[Koehler2004-11] Koehler, Wallace (2004). "A longitudinal study of web pages continued: a consideration of document persistence". Information Research. 9 (2). Archived from the original on 2017-09-11. Retrieved 2019-01-31. {{cite journal}}: Unknown parameter |dead-url= ignored (|url-status= suggested) (help)

[12] Hennessey, Jason; Xijin Ge, Steven (2013). "A Cross Disciplinary Study of Link Decay and the Effectiveness of Mitigation Techniques". BMC Bioinformatics. 14: S5. doi:10.1186/1471-2105-14-S14-S5. PMC 3851533. PMID 24266891. Archived from the original on 21 January 2015. Retrieved 16 January 2015. {{cite journal}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)CS1 maint: unflagged free DOI (link)

[13] "All-Time Weblock Report". August 2015. Archived from the original on 4 March 2016. Retrieved 12 January 2016. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[Bar-Yossef2004-14] Bar-Yossef, Ziv; Broder, Andrei Z.; Kumar, Ravi; Tomkins, Andrew (2004). "Sic transit gloria telae: towards an understanding of the Web's decay". Proceedings of the 13th international conference on World Wide Web – WWW '04. pp. 328–337. CiteSeerX 10.1.1.1.9406. doi:10.1145/988672.988716. ISBN 978-1581138443. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)

[Kille2014-15] Kille, Leighton Walter (8 November 2014). "The Growing Problem of Internet "Link Rot" and Best Practices for Media and Online Publishers". Journalist's Resource, Harvard Kennedy School. Archived from the original on 12 January 2015. Retrieved 16 January 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[Eysenbach2005-16] Eysenbach, Gunther; Trudel, Mathieu (2005). "Going, going, still there: Using the WebCite service to permanently archive cited web pages". Journal of Medical Internet Research. 7 (5): e60. doi:10.2196/jmir.7.5.e60. PMC 1550686. PMID 16403724.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[Justaddwater_2007-17] Rønn-Jensen, Jesper (2007-10-05). "Software Eliminates User Errors And Linkrot". Justaddwater.dk. Archived from the original on 11 October 2007. Retrieved 5 October 2007. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[Berners-Lee1998-18] Berners-Lee, Tim (1998). "Cool URIs Don't Change". Archived from the original on 2000-03-02. Retrieved 2019-01-31. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[GoogleToolbar-19] Mueller, John (2007-12-14). "FYI on Google Toolbar's Latest Features". Google Webmaster Central Blog. Archived from the original on 13 September 2008. Retrieved 9 July 2008. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[20] "Wayback Machine: Now with 240,000,000,000 URLs | Internet Archive Blogs". 2013-01-09. Archived from the original on 2017-09-12. Retrieved 2014-04-16. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[21] "Internet Archive: Digital Library of Free Books, Movies, Music & Wayback Machine". 2001-03-10. Archived from the original on 26 January 1997. Retrieved 7 October 2013. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[22] "Hiberlink". Hiberlink.org. Archived from the original on 29 January 2015. Retrieved 15 January 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[23] "Memento: Time Travel for the Web". Memento. Archived from the original on 7 January 2015. Retrieved 15 January 2015. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[24] "Harvard University's Berkman Center Releases Amber, a "Mutual Aid" Tool for Bloggers & Website Owners to Help Keep the Web Available | Berkman Center". cyber.law.harvard.edu. Archived from the original on 2016-02-02. Retrieved 2016-01-28. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[25] Habibzadeh, Parham (2015-07-30). "Are current archiving systems reliable enough?". International Urogynecology Journal. 26 (10): 1553. doi:10.1007/s00192-015-2805-7. ISSN 0937-3462. PMID 26224384.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

@@ Line 3: / Line 3: @@
 ==Terminology==
-Link rot is also called "link death", "link breaking", or "reference rot". A link that does not work any more is called a "broken link", "dead link", or "dangling link". Formally, this is a form of [[dangling reference]]: the target of the reference no longer exists.
+Link rot is also called "link death", "link breaking", or "reference rot". A link that does not work any more is called a "[https://app.neilpatel.com/en/traffic_analyzer/overview?domain=freepromocodeoffers.com&locId=2356&lang=en&type=organic broken link]", "dead link", or "dangling link". Formally, this is a form of [[dangling reference]]: the target of the reference no longer exists.
 ==Causes==