User:GreenC/WaybackMedic
Appearance
(Redirected from User:Green Cardamom/WaybackMedic)
This is an old version. The latest version is WaybackMedic 2.5 |
Wayback Medic is a bot that fixes problems with links to Internet Archive. Mostly it removes/replaces Wayback links that don't work (#4 below).
- WM fixes
Fix number | Function name | Example edit | Description | Notes |
---|---|---|---|---|
1 | fixthespuriousone | Example | Remove spurious |1= in cite templates.
|
|
2 | fixmissingprotocol | Example | These are done only when another fix is done: 1. Add https if missing from the archive.org URL. 2. Add second-level domain web if missing (archive.org/web/ --> web.archive.org/web/) 3. Add "/web" path (web.archive.org/2016/ --> web.archive.org/web/2016/) 4. Remove ":80" (eg. https://web.archive.org/web/2016/http://example.com:80/). Port 80 is added by the API and not needed. Non-80's are retained. |
HTTPS per RFC |
3 | fixemptyarchive | Example | If |archiveurl= is empty, remove |archiveurl= and |archivedate= and add {{dead link}} . If |archiveurl= is empty but the |url= is working then leave alone.
|
|
4 | fixbadstatus | Example | Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}} .
|
|
5 | fixtrailingchar | Example | In some cases the URL erroneously trails a "." or "," or ":" or "-" | |
6 | fixemptywayback | Example | The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example). | |
7 | fixencodedurl | Example | The URL was incorrectly encoded. Fully decode URL and re-encode. |
- WM examines
{{wayback}}
templates inside ref pairs.- Citation templates inside and outside ref pairs.
- Bare wayback URLs outside templates. If these return 404 etc replace with the regular URL. WM is currently unable to add
{{dead link}}
in this case.
- WM design
- Multiple HTTP checks at application layer if Wayback reports an error to account for brief outages or intermittent responses.
- In addition time outs & retries built-in to the web transfer agent settings (wget)
- Multiple checks of the Wayback API using multiple dates to ensure a page really is unavailable.
- Re-checks the API results by looking at the header to ensure it really is a good page.
- If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
- If no Wayback available, checks Memento for alternative archives such as Library of Congress, WebCite and a few dozen others.
Statistics
[edit]- August 2015 to June 6, 2016
WaybackMedic checked ~140k articles edited by Cyberbot II from August 2015 to June 6, 2016. It found ~374596 wayback links (includes duplicates) of which 29171 were dead in 17978 articles. It was able to fix 8785 by finding a new snapshot date, and 661 by finding an alternative archive service through Mementoweb.org - the rest 19602 were deleted from Wikipedia (robots.txt or missing page or link was never good to begin with). Other fixes and problems were logged and corrected.
Type | Number | Description |
---|---|---|
Bummer | 215 | Wayback links that return "Bummer page not found" |
API mismatch | 15596 | Wayback API returned fewer records than sent |
Bogusapi | 4360 | Wayback API-returned links that don't match real status code |
JSON mismatch | 7990 | Wayback API returned different size JSON |
Discovered | 17978 | Number of articles edited by WaybackMedic |
Log 404 | 29171 | Dead wayback links |
Log emptyarch | 768 | Empty archiveurl arguments |
Log emptyway | 486 | Ref has an empty {{wayback}} |
Log encode | 82 | URL misencoded |
Log spurious 1 | 1294 | Spurious |1= parameter
|
Log trail | 104 | URL has a trailing bad character |
Log dead URL | 244 | |url= is dead even though |deadurl= no, |archiveurl= dead and missing {{dead}}
|
New alt archive | 661 | Replaced with archive URL found at Mementoweb.org |
New IA date | 8785 | Changed snapshot date |
Redirects | 512 | Page was a redirect |
Zombie links | 145 | Links needing removal by hand |
Wayback RM | 19602 | Wayback link deleted |
Wayback All | 374596 | Wayback links total found |
Links
[edit]- Bot Request for Approval
- WaybackMedic on GitHub
- DISCLAIMER
Please report problems.