User:GreenC/WaybackMedic

From Wikipedia, the free encyclopedia

Wayback Medic is a bot that fixes problems with links to Internet Archive. Mostly it removes/replaces Wayback links that don't work (#4 below).


WM fixes
WaybackMedic Fixes
Fix number Function name Example edit Description Notes
1 fixthespuriousone Example Remove spurious |1= in cite templates.
2 fixmissingprotocol Example These are done only when another fix is done:
1. Add https if missing from the archive.org URL.
2. Add second-level domain web if missing (archive.org/web/ --> web.archive.org/web/)
3. Add "/web" path (web.archive.org/2016/ --> web.archive.org/web/2016/)
4. Remove ":80" (eg. https://web.archive.org/web/2016/http://example.com:80/). Port 80 is added by the API and not needed. Non-80's are retained.
HTTPS per RFC
3 fixemptyarchive Example If |archiveurl= is empty, remove |archiveurl= and |archivedate= and add {{dead link}}. If |archiveurl= is empty but the |url= is working then leave alone.
4 fixbadstatus Example Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}}.
5 fixtrailingchar Example In some cases the URL erroneously trails a "." or "," or ":" or "-"
6 fixemptywayback Example The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example).
7 fixencodedurl Example The URL was incorrectly encoded. Fully decode URL and re-encode.
WM examines
  • {{wayback}} templates inside ref pairs.
  • Citation templates inside and outside ref pairs.
  • Bare wayback URLs outside templates. If these return 404 etc replace with the regular URL. WM is currently unable to add {{dead link}} in this case.
WM design
  • Multiple HTTP checks at application layer if Wayback reports an error to account for brief outages or intermittent responses.
  • In addition time outs & retries built-in to the web transfer agent settings (wget)
  • Multiple checks of the Wayback API using multiple dates to ensure a page really is unavailable.
  • Re-checks the API results by looking at the header to ensure it really is a good page.
  • If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
  • If no Wayback available, checks Memento for alternative archives such as Library of Congress, WebCite and a few dozen others.

Statistics[edit]

August 2015 to June 6, 2016

WaybackMedic checked ~140k articles edited by Cyberbot II from August 2015 to June 6, 2016. It found ~374596 wayback links (includes duplicates) of which 29171 were dead in 17978 articles. It was able to fix 8785 by finding a new snapshot date, and 661 by finding an alternative archive service through Mementoweb.org - the rest 19602 were deleted from Wikipedia (robots.txt or missing page or link was never good to begin with). Other fixes and problems were logged and corrected.

WaybackMedic Stats
Type Number Description
Bummer 215 Wayback links that return "Bummer page not found"
API mismatch 15596 Wayback API returned fewer records than sent
Bogusapi 4360 Wayback API-returned links that don't match real status code
JSON mismatch 7990 Wayback API returned different size JSON
Discovered 17978 Number of articles edited by WaybackMedic
Log 404 29171 Dead wayback links
Log emptyarch 768 Empty archiveurl arguments
Log emptyway 486 Ref has an empty {{wayback}}
Log encode 82 URL misencoded
Log spurious 1 1294 Spurious |1= parameter
Log trail 104 URL has a trailing bad character
Log dead URL 244 |url= is dead even though |deadurl=no, |archiveurl=dead and missing {{dead}}
New alt archive 661 Replaced with archive URL found at Mementoweb.org
New IA date 8785 Changed snapshot date
Redirects 512 Page was a redirect
Zombie links 145 Links needing removal by hand
Wayback RM 19602 Wayback link deleted
Wayback All 374596 Wayback links total found

Links[edit]

DISCLAIMER

Please report problems.