User:GreenC/software/linkrot
How to fix link rot - on Wikipedia and elsewhere.
The 3 results
[edit]Given a URL, there are only 3 basic results:
- Convert to an archive URL: http://example.com --> https://web.archive.org/web/20240601/http://example.com
- Move to a new URL: http://example.com --> http://example-new.com
- Do nothing leave it alone: http://example.com
The 3 factors
[edit]When deciding which of the 3 results, there are 3 factors:
- Redirects - A redirect is when a URL redirects to a different URL.
- Soft-404s - A soft-404 is any page that contains content different from the desired content. Typically redirects to a home page.
- Soft-redirects - A soft-redirect is when the page is live at a different URL, but there is no active redirect to the new URL.
To properly determine which of the 3 results to choose, the 3 factors need to be known ahead of time. This foreknowledge may come from other editors who inform you that a URL has moved. Or it may come through discovery, by looking at logs to see where URLs redirect to, and interpreting that information. It's a process to learn the information, codify it, and upload the results.
Process
[edit]Process to decide the 3 results
- Codify any pre-known soft-redirects. These would be hard coded rules, based on foreknowledge. Thus, transform http://example.com to http://example-new.com - We'll call this the "newurl"
- Check newurl for redirects -- we'll call this the newloc URL ie. the "new location" URL
- Make a two-column table: newurl <tab> newloc
- Analyze the table looking for repeating instances of the same newloc in the second column. These indicate probable soft-404s.
- Add new rules (code) to account for the soft-404s,
- Re-process the links with the soft-404 rules in place.
- Check every URL and redirect URL for status 200 or 404.
- If 404, then add an archive URL result #1
- If 200, return the newurl ie. result #2 or result #3 .. depending on the value newurl
Example code
[edit]The following pseudo-code demonstrates the steps:
origurl = "http://example.com" newurl = sub("example.com", "example-new.com", origurl) # Step 1 - codify known soft-redirects (status, newloc) = networkcheck(newurl) # status = 200, 404, etc.. .. this is Step 2 - check newurl for redirects # newloc = redirect URL if newloc then print newurl "\t" location > table.txt # Step 3 = make a two column table endif
At this point we follow Step #4 and look at the table which might look something like this:
http://example.com/page1.htm https://example.com http://example.com/page2.htm https://example.com http://example.com/page3.htm https://example.com/page3.htm http://example.com/page4.htm https://example.com http://example.com/page5.htm https://example.com/page5.htm
Here we see page1, page2, and page4 redirect to the home page. The others redirect to "https". So we have learned two new rules:
- All URLS in this domain have a soft-redirect to https
- Any URL that redirects to http://example.com is a soft-404.
So we modify the code as follows:
origurl = "http://example.com" newurl = sub("example.com", "example-new.com", origurl) # Step 1 - known soft-redirect newurl = sub("http:", "https:", newurl) # Step 1 - known soft-redirect (status, newloc) = networkcheck(newurl) # status = 200, 404, etc.. .. this is Step 2 - check newurl for redirects # newloc = redirect URL if newloc then if newloc == "https://example.com" then # Step 5 - soft-404 return "404" newurl = newloc (status, newloc) = networkcheck(newurl) endif if status == 200 then return newurl else return "404" endif
Thus the above code will return:
http://example.com/page1.htm --> https://web.archive.org/web/20240601/http://example.com/page1.htm http://example.com/page2.htm --> https://web.archive.org/web/20240601/http://example.com/page2.htm http://example.com/page3.htm --> https://example-new.com/page3.htm http://example.com/page4.htm --> https://web.archive.org/web/20240601/http://example.com/page4.htm http://example.com/page5.htm --> https://example-new.com/page5.htm