User:Dispenser/Checklinks

From Wikipedia, the free encyclopedia
< User:Dispenser  (Redirected from Wikipedia:CHECKLINKS)
Jump to: navigation, search
Shortcuts:
Checklinks
Original author(s) Dispenser
Initial release July 15, 2007 (2007-07-15)
Operating system Client: Any Web browser
Platform web pywikipedia
Type Link checker, Wikipedia Tool
Website toolserver.org/~dispenser/view/Checklinks

Checklinks is a tool on the Toolserver that checks external links for Wikimedia Foundation wikis. It parses a page, queries all external links and classifies them using heuristics. The tool runs in two modes: an on-the-fly for instant results on individual pages and project scan for producing reports for interested WikiProjects.

The tool is typically used in one of two ways: in the article review processes as a link auditor to make sure the links are working and the other as a link manager where links can be reviewed, replace with a working or archive link, add citation information, tagger, and removed.

Background[edit]

Linkrot is a major problem for English Wikipedia more than other websites since external links are the sourcing practices of providing links to references. Some of the dead links are caused by content being moved around without proper redirection, while others require micropayments after a certain time period, and others simply vanish. With nearly a hundred links in an article it becomes an ordeal to ensure that all the links and references are working correctly even in our featured articles that appear on the main page.

Some Wikipedians had already built scripts to scan for dead links. There are giant aging lists like those at Wikipedia:Dead external links, which was last updated late 2006. However, the script required much work repairing the dead links which involved manual checking to see if the link was still there, searching for the replacement, editing the link. Much of this is repetitive and inefficient use of a human's time. This tool attempts to increase efficiency as much as possible by combining the most used features into its interface.

Running[edit]

Type or paste into the input box the URL or page's title or a wikilink. All major languages and projects are supported. MediaWiki search functionality is not supported in this interface at this time.

Interface[edit]

Tools ▼ Save changes Jimmy Wales
Ref External link HTTP Analysis
37 Wikipedia Founder Edits Own Bio (info) [wired.com]
accessdate=2006-02-14
publisher=Wired
work=Wired News
302 Changes to date style path
41 In Search of an Online Utopia[dead link] (info) [msn.com] 302 Changes domain and redirect to /
Page heading
  1. Name of the article for the set of links bellow
  2. Other tools which allow checking of contributions and basic page information.
  3. "Save changes" is used after setting actions using the drop down.
Link
  1. Reference number
  2. The external link. May contain information extracted from {{cite web}} or {{citation}}.
  3. HTTP status code; tooltip contains the reason as stated by the web server.
  4. Analysis information. In this example it determined that one of the 302 redirects was likely a dead link.

Classifications[edit]

Identifier Rank Meaning Action
Working (White) 0 The link appears to work No action necessary.
Message (Green) 1 An HTTP Move (redirect) has occurred. Link should work but should be checked. If the server responded with HTTP 301 consider updating it.
Warn (Yellow) 2 Link that could pose a problem to users. This includes expiring News sources, subscription required, or low signal to noise of links to text. If the link is expiring ensure that all critical detail are fill in to allow someone to find an offline copy.
Heuristically determined (Orange) 4 The tool thinks that the link is dead. 404 in redirects or redirection / of the website. Check the link, if dead attempt to use archiveurl with an archived copy from the Internet Archive, otherwise tag with {{dead link}}.
Client Error (Red) 5 Server has confirmed the link as dead. Ensure the link is correct and doesn't have any bits of wiki markup. If possible use archiveurl with an archived copy from the Internet Archive, otherwise tag with {{dead link}}.
Server Error or Connection Issue (Blue) 3 Five hundred Server Error or Connection Issue If a Server Error contact the webmaster to fix the problem. If a connection issue check to see if the Whois is still valid.
Bad link (Purple) 6 Spamlink or Google Cache link Parking links should be removed. Google Cache links should be converted back to the regular link or archiveurl

Repair[edit]

Once the page has fully loaded, select an article to work on. Click on the link to make sure the tool has correctly identified the problem (errors can be reported on the talk page). If the link is incorrect you can try a Google search to locate it again, right-click and copy the URL, and paste into prompt create by the "Input correct URL" option or "Input archive URL". The color in the box on the left changes to the type of replacement that will be performed on the URL. When you're finished click "Save changes" and the tool will merge your changes and present a preview or the difference before letting you save.

Redirects[edit]

There are principally two types of redirects used:[note 1] HTTP 301 (permanent redirect) and HTTP 302 (regular redirect). In the former it is recommended that the site update the URL to use the new address. While in contrast, the latter is optional and should be reviewed by a human operator.

Some links might be access redirect as to avoid the need to log into a system. These may be said to be permalink. Finally, there are redirects that point to fake or soft 404 pages. Do not blindly change these links![clarification needed]

Do not "fix" redirects[edit]

  • Removes access to archive history by WebCite and the Wayback Machine at the old URL
  • WP:NOTBROKEN calculates the cost an edit far excesses the value of fixing a MediaWiki redirect. A similar thing can be said about redirect on external links.

Archives[edit]

The Wayback Machine is a valuable tool for dead link repair. The simplest way to get the list of links from archive.org is to click on the row. You can also load the results manually and paste them in using the "Use archive URL" option. The software will attempt to insert the URL using the archiveurl parameter of {{cite web}}.

Tips[edit]

  • Most non-news links can be found again by doing a search with the title of the link. This is the default setup for searching.
  • Link can be taken from the Google results via right-clicking and selecting "Copy Link Location" and inputting it through the drop down.
  • Always check the link by clicking on it (not the row) as some websites do not like how tools send requests (false positive) or the tool was not smart enough to handle the incorrect error handling (false negative).
  • Non-HTML document can sometimes be found by searching for their file name.
  • If Google turns up the same link, leave it be as it has recently or temporally become dead and you will not find a replacement until the Google's index is updated.
  • You may wish to email the webmaster asking them to use redirection to keep the old links working.

Internal workings[edit]

The tool downloads the wiki text using the edit page. It checks that the page exists and is not a redirect. Then it processes the markup: escaping certain comments so they are visible, remove nowiki'ed parts, expand link templates, numbering bracketed links, adding reference numbers, and marking links tagged with {{dead link}}. Since templates are not actually expanded this prevents some from working as intended, most notably external link templates. A possible remedy is to use a better parser such as mwlib from Collection. The parsed paged can be seen by appending &source=yes&debug=yes to the end of the URL.

Limitations[edit]

  • CNN.com has blocked the tool in the past, and this domain is now disabled to prevent waiting for connection timeouts
  • Excludes external links transcluded from templates, this is on purpose as the tool wouldn't be able to modify these when saving.

Linking[edit]

It is preferable to link using the tools: interwiki prefix. Change the link as such:

[http://toolserver.org/~dispenser/view/Checklinks checklinks]
               [[tools:~dispenser/view/Checklinks|checklinks]]

Linking to a specific page (swap ?page= for /):

[http://toolserver.org/~dispenser/cgi-bin/webchecklinks.py?page=Edip_Yuksel Edip Yuksel links]
               [[tools:~dispenser/cgi-bin/webchecklinks.py/Edip_Yuksel     |Edip Yuksel links]]

Praise[edit]

Vitruvian Barnstar.png The da Vinci Barnstar
For the work you do on the link checker tool, which makes FAC so much easier. Thank you. Ealdgyth - Talk 14:53, 18 May 2008 (UTC)

Documentation TODO[edit]

  • ADD information for website to opt out of scanning
  • Break things up so can be read non-linearly (i.e. use pictures, bullets)
  • Explain why detection isn't 100%. Give examples of website that return 404 for content. Others which are dead until the disks on the server finish spin up. Those which return 200 on Error pages, etc.
  • Users don't seem to understand that they can make edits WITH the tool or search the Internet Archive Wayback Machine and WebCite (archive too).

Notes[edit]

  1. ^ An HTTP redirect is not the same as a redirect used on Wikipedia.

See also[edit]

  • /config a template that sets up periodic checking for links for a project

External links[edit]