Wikipedia:Link rot/URL change requests

	; Archives (Index) ;
	2019/October; 2019/November; 2019/December; 2020/February; 2020/March; 2020/April; 2020/May; 2020/June; 2020/July; 2020/August; 2020/September; 2020/October; 2020/November; 2020/December; 2021/January; 2021/February; 2021/March; 2021/April; 2021/May; 2021/June; 2021/July; 2021/August; 2021/September; 2021/October; 2021/November; 2021/December; 2022/January; 2022/February; 2022/March; 2022/April; 2022/May; 2022/June; 2022/July; 2022/August; 2022/September; 2022/October; 2022/November; 2022/December; 2023/January; 2023/February; 2023/March; 2023/April; 2023/May; 2023/June; 2023/July; 2023/August; 2023/September; 2023/October; 2023/November; 2023/December; 2024/January; 2024/February; 2024/March;
	This page is archived by ClueBot III.

Shortcut

WP:URLREQ

This page is for requesting modifications to URLs, such as marking dead or changing to a new domain. Some bots are designed to fix link rot, they can be notified here, these include InternetArchiveBot and WaybackMedic. This page can be monitored by bot operators from other language wikis since URL changes are universally applicable.

Bot might convert links to httpS?

There are several thousand "http" links on WP to many different pages of my site (whose homepage is http://penelope.uchicago.edu/Thayer/E/home.html) which really should be httpS. The site is secure with valid certificates, etc. Is this something a bot can take care of quickly?

24.136.4.218 (talk) 19:20, 11 February 2021 (UTC)[reply]

In general, I think a bot could replace http with https for all webpages, after some checks. (The guidelines prefer https over http, WP:External links#Specifying_protocols.) My naive idea is to create a bot that goes through http-links and checks if they are also valid with https. If they are, then the bot can replace the http-link with the https-link. Apart from the question if there is a general problem with the idea, a few questions remain:

Should the bot replace all links or only major ones (official webpage, info boxes, ...)?
Should the bot only check if https works or if http and https provide the same page?

I would be happy to hear what others think about the idea. Nuretok (talk) 11:43, 5 April 2021 (UTC)[reply]

One argument against this is that, many websites implement a http -> https redirect. Thus if one access the link with http, it will be redirected to https. In this case, it would not matter what protocol the link is in WP, the user would always end up on https. Even the cited example above is redirected. -- Srihari Thalla (talk) 19:09, 8 April 2021 (UTC)[reply]

You are right, many websites forward http to https, but this still allows a Man-in-the-middle attack when someone prevents this redirect. This is one of the reasons the Wikipedia-guidelines recommend using https and browser plugins such as HTTPS Everywhere exist. Of course everyone is free to use https everywhere, but providing good defaults (https in this case) is usually considered good practice. By the way, instead of checking each site individually, there is a list of servers that support https which the bot could check to see if it wants to move from http to https.Nuretok (talk) 08:20, 17 April 2021 (UTC)[reply]

mcleans.ca

I found a few dozen broken links to mcleans.ca. Can these links be replaced with archive URLs? Jarble (talk) 19:11, 2 February 2021 (UTC)[reply]

Jarble: There are 517 that use "www2". A test of a couple suggests they might be made live again by removing the "2". Any that report 404 can be archived. Will check into it. -- GreenC 22:21, 2 February 2021 (UTC)[reply]

results: Converted 575 URLs in 496 articles. The few www2 remaining are archive URLs. -- GreenC 18:17, 3 February 2021 (UTC)[reply]

Replace Airdisaster.com links

Copy-pasted thread from WP:BOTREQ

The website airdisaster.com appears to be used in several articles about aviation accidents, but now links to a spam site/domain hoarder, which seems very undesirable for users. Can someone get the direct links removed and where possible linked to an archived page? In particular where it is linked as an external link, occurrences in references appear to be fixed already Pieceofmetalwork (talk) 16:07, 9 January 2021 (UTC)[reply]

@Pieceofmetalwork: Are you suggesting adding {{webarchive}} like this edit? GoingBatty (talk) 18:46, 10 January 2021 (UTC)[reply]

Yes, that would be a good solution. Pieceofmetalwork (talk) 18:48, 10 January 2021 (UTC)[reply]

@Cyberpower678: Could these links be replaced by the Internet Archive Bot? Jarble (talk) 20:18, 22 January 2021 (UTC)[reply]

Suggest trying WP:URLREQ, Jarble. ProcrastinatingReader (talk) 16:04, 2 February 2021 (UTC)[reply]

Yes this is URLREQ since it should also toggle |url-status=unfit. -- GreenC 21:36, 3 February 2021 (UTC)[reply]

It's done. Example edits: [1][2][3][4], etc.. -- GreenC 03:16, 4 February 2021 (UTC)[reply]

observer.com

I found many broken links to www.observer.com: some (but not all) of these links no longer lead to the articles that were originally cited. Jarble (talk) 21:04, 13 February 2021 (UTC)[reply]

Since this is a mix of live and dead probably better to leave it for IABot which should be able to detect the dead. -- GreenC 03:19, 14 February 2021 (UTC)[reply]

@GreenC: IABot won't detect them. I tried running IABot on this page, but the link is still incorrect. Jarble (talk) 21:35, 11 March 2021 (UTC)[reply]

IABot won't work. It's pretty complex. First impression is anything "https" is OK. Anything "http" without a hostname is also OK. That narrows it down to about a thousand possible trouble URLs. Of these, some work and some don't. Some are also redircting to spam links needing |url-status=unfit. There are patterns, but also exceptions. I might need to make a dry run, log what it does, build rules to take into account the mistakes, then make a live run. Hard to say up front what the rules should be. Will take some time to figure out, there are a lot of variables. -- GreenC 01:45, 12 March 2021 (UTC)[reply]

Results

121 URLs modified (Example)
412 URLs archived (Example)

The rest were already archived or still working or now tagged with {{dead link}}. Once the soft404 redirects were identified it was not too difficult. If you see any problems let me know. @Jarble: -- GreenC 21:39, 13 March 2021 (UTC)[reply]

sfsite.com/~silverag

My website, formerly located at www.sfsite.com/~silverag has moved to www.stevenhsilver.com. It is used as a citation on numerous wikipedia pages. If a bot could go through and replace the string sfsite.com/~silverag with stevenhsilver.com it would correct the broken links. Shsilver (talk) 12:57, 14 February 2021 (UTC)[reply]

Hi, the bot switched 108 URLs. There are 13 left the bot could not determine. -- GreenC 17:54, 14 February 2021 (UTC)[reply]

Thanks. Some of those switched, others pointed to pages I decided not to upload to the new site. I appreciate your work and your bot's. Shsilver (talk) 19:19, 14 February 2021 (UTC)[reply]

Illinois Historic Preservation Agency

Hello, the Illinois Historic Preservation Agency recently took down their website because it was based on Adobe Flash, breaking lots of links to documentation. I just checked a random one, and it was in the Internet Archive, so I assume that the link-changing bots could archive a good number of them. Could someone have a bot collect all the URLs of a form http://gis.hpa.state.il.us/pdfs/XXXXXX.pdf and run all of them through IA? "X" represents a numeral; some of these files may have five or fewer numerals (XXXX.pdf), or may have seven or more (XXXXXXXX.pdf), so please don't assume that they're all six digits.

Thanks! Nyttend (talk) 19:27, 16 February 2021 (UTC)[reply]

Hi Nyttend, results are in 1,151 articles, 1,035 archive URLs and 217 {{dead link}}'s added. Let me know if you see any problems. PDFs are the easiest as they either clearly work or not. -- GreenC 01:35, 17 February 2021 (UTC)[reply]

Thank you, GreenC. If you click any IHPA link (even my XXXXXX sample), you're taken to a page that says "A new version of HARGIS will be available in the coming weeks." (This was the case before I made this request; I asked because there's no guarantee that the new site will use the same linking structure for its PDFs.) Do you have a way of looking up where the 217 dead links are located? When I notice that they've put up the new version of the site, I may come back and ask for help getting the links working again, but only if you have a way of going through the ones that your bot handled, without un-archiving the 1035. Nyttend (talk) 12:19, 17 February 2021 (UTC)[reply]

In that case the 217 + 1035 might be live again (there are logs). Ping me when ready and will take a look. The bot can unwind archives, replace dead links with live, move URL schemes, retrieve new URLs from redirects, etc.. -- GreenC 15:39, 17 February 2021 (UTC)[reply]

whitehouse.gov

A lot of whitehouse.gov links have died after the domain recently "changed owner". A rare occasion where many Wikipedians may be glad for sources dying. There is an archive at https://trumpwhitehouse.archives.gov. Example of old broken and new working url:

There is a slim chance/risk that some of the broken links will work again in about four years. Some whitehouse.gov links are working and should not be changed. Can a bot sort it out? PrimeHunter (talk) 13:09, 25 February 2021 (UTC)[reply]

Some older source links are archived at https://obamawhitehouse.archives.gov or https://georgewbush-whitehouse.archives.gov.

Obama example of broken and working link:

Bush example of broken and working link:

Some links work via redirects:

https://www.whitehouse.gov/the-press-office/2013/06/24/daily-briefing-press-secretary-jay-carney-6242013

redirects to

https://obamawhitehouse.archives.gov/the-press-office/2013/06/24/daily-briefing-press-secretary-jay-carney-6242013

https://www.archives.gov/presidential-libraries/archived-websites also mentions Clinton archives. The newest is https://clintonwhitehouse5.archives.gov/ from January 2001. I don't know whether we have broken links it could fix.

A bot could test every whitehouse.gov link to see whether it works now or at any of the archives. PrimeHunter (talk) 14:02, 25 February 2021 (UTC)[reply]

OK, based on your research, I agree it's worth exploring to see how well it works. Will take a look. -- GreenC 14:25, 25 February 2021 (UTC)[reply]

Results: modified 8,263 URLs in 5,060 articles. Changed metadata info such as |work=whitehouse.gov. Plus other general fixes by WaybackMedic. Matter of curiosity: 67% were found by the scanning method described above and the rest had working redirects in the header. Most of the working redirects were Obama, Trump had a high proportion of 404s and no redirects, perhaps poorly maintained and/or too soon after leaving office. Also some pages (10%?) can't be archived by any web archive service, they just don't work, there is something in the page that prevents web archiving by third parties but regardless they still work at the National Archives. @PrimeHunter: -- GreenC 16:46, 3 March 2021 (UTC)[reply]

@GreenC: Great! Thanks a lot. Do you have a list of broken links which couldn't be fixed? I noticed one in [5]: https://www.whitehouse.gov/the-press-office/2013/05/20/president-obama-announces-sally-ride-recipient-presidential-medal-freedom. It redirects but the target doesn't work. Thanks for checking the redirect didn't help. It turned out to be our own fault. The real link [6] didn't have a final m which was added by a careless editor in [7], so there is no general fix we can learn from that. PrimeHunter (talk) 22:30, 3 March 2021 (UTC)[reply]

There were 30: Wikipedia:Link rot/cases/whitehouse.gov -- GreenC 22:55, 3 March 2021 (UTC)[reply]

@GreenC: Thanks. That's a nice low number. I have fixed many of them with guessing or Googling without finding a system. Some were clearly our own fault with url's that never would have worked. Should I remove the fixed ones from Wikipedia:Link rot/cases/whitehouse.gov? PrimeHunter (talk) 02:21, 4 March 2021 (UTC)[reply]

Yes about 0.5% of the whitehouse URLs is explainable by local data entry or remote site errors, it's probably better than one might expect. It's a good idea to check for, and great you were able to fix some. Use the page any way you like, markup or delete entries. -- GreenC 03:12, 4 March 2021 (UTC)[reply]

StarWars.com

Anything with http://www.starwars.com should be changed to https. Thanks. JediMasterMacaroni (Talk) 18:20, 25 February 2021 (UTC)[reply]

Forwarded to User_talk:Bender235#StarWars.com -- GreenC 19:04, 25 February 2021 (UTC)[reply]

Will do. --bender235 (talk) 19:33, 25 February 2021 (UTC)[reply]

Thanks. JediMasterMacaroni (Talk) 19:34, 25 February 2021 (UTC)[reply]

Replace `atimes.com` links

Please replace all instances of atimes.com and its subdomains with asiatimes.com. The old website is replaced by an advertising site. ~ Ase1este_{charge-parity}^time 10:11, 28 February 2021 (UTC)[reply]

Also, if the corresponding page with the new domain is not found, not archived, and there is an archive with the old domain, then do not replace the URL, but add the archive link and mark the URL status as unfit. Thanks. ~ Ase1este_{charge-parity}^time 10:26, 28 February 2021 (UTC)[reply]

Ok. It might take a couple passes, first to move the domain where possible, and second to add the archives+unfit for the remainder. Still working on the whitehouse.gov above could be a few days at least. -- GreenC 15:46, 28 February 2021 (UTC)[reply]

Ok, thanks, I can wait. ~ Ase1este_{charge-parity}^time 17:42, 28 February 2021 (UTC)[reply]

Results:

287 URLs changed from atimes.com to asiatimes.com
1,995 URLs converted to archives including |url-status=unfit. Includes CS1|2, square and bare links
3 URLs had no archives (in Peter Heehs, Thaksin Shinawatra, Iran–Saudi Arabia relations). Added {{dead link}}. Need manual attention.
11 citations converted from [square link]{{webarchive}} to {{cite web}} with |url-status=unfit.
1 URL in File: space
Domain status set to 'Blacklisted' in the IABot database.

@Aseleste: I think that is all, if you see anything else let me know. -- GreenC 04:23, 6 March 2021 (UTC)[reply]

Looks good, thanks! ~ Ase1este_{charge-parity}^time 04:28, 6 March 2021 (UTC)[reply]

www.geek.com

I found many broken links on this domain: is it possible to fix them automatically? Jarble (talk) 21:30, 11 March 2021 (UTC)[reply]

This is the same situation as observer.com -- in the IABot database the domain is set to Whitelisted thus the bot is not checking/fixing dead links. My bot can try, it's a lot easier than observer as the numbers are small and only requires checking for 404s. -- GreenC 01:51, 12 March 2021 (UTC)[reply]

unc.edu

Thread copied from WP:BOTREQ#Replace_dead_links

Please could someone replace ELs of the form

https://www.unc.edu/~rowlett/lighthouse/bhs.htm (a dead link)

with

{{Cite rowlett|bhs}}

which produces

Rowlett, Russ. "Lighthouses of the Bahamas". The Lighthouse Directory. University of North Carolina at Chapel Hill.

Thanks — Martin (MSGJ · talk) 05:38, 19 March 2021 (UTC)[reply]

What sort of scale of edits are we talking (tens, hundreds, thousands)? Primefac (talk) 14:37, 19 March 2021 (UTC)[reply]

Special:LinkSearch says 1054 for "https://www.unc.edu/~rowlett/lighthouse" and 483 for the "http://" variant. DMacks (talk) 14:43, 19 March 2021 (UTC)[reply]

But spot-checking, it's a mix of {{cite web}}, plain links, and links with piped text, and with/without additional plain bibliographic notes. For example, 165 of the https:// form are in a "url=..." context. I think there are too many variations to do automatically. DMacks (talk) 15:06, 19 March 2021 (UTC)[reply]

MSGJ, the only type that can be converted are {{cite web}} as noted by User:DMacks it's too messy to determine the square and bare links due to free form text that might be surrounding the URL, unless there is some discernible pattern. There are 334 articles that contain a preceding "url=". Couple questions:

Do you know if the content at http://www.ibiblio.org/lighthouse/* is the same as https://www.unc.edu/~rowlett/* as originally cited? ie. what are the chances there has been content drift for these pages.
What would you do if the old cite has an |archiveurl= .. delete the archive or leave the cite alone?

-- GreenC 19:19, 19 March 2021 (UTC)[reply]

Thanks for looking into this GreenC. I asked at Template talk:Cite rowlett and the working ibiblio.org links almost exactly correspond to the old unc.edu/~rowlett links. I'm not sure what to do ith archive links. Keep them if they are working? The use of {{Cite rowlett}} would be preferable, where possible, but if not, then the bare links can just be replaced. Thanks — Martin (MSGJ · talk) 21:49, 22 March 2021 (UTC)[reply]

nytimes.com links to All Movie Guide content

Links to https://www.nytimes.com/movies/person/* are dead and reporting as a soft-404 thus not picked up by archive bots. There are about 1300 articles with links in https and about 150 in http. The URLs are to The New York Times, but the content is licensed to All Movie Guide thus if in a CS1|2 citation it would convert to |work=All Movie Guide and |via=The New York Times. In addition an archive URL if available otherwise marked dead. Extra credit it could try to determine the date and author by scraping the archive page. Example. -- GreenC 18:00, 6 April 2021 (UTC)[reply]

Same with https://www.nytimes.com/movies/movie/* of which there are about 3,000 in https and about 170 in http. -- GreenC 18:10, 6 April 2021 (UTC)[reply]
Found a lot more at movies.nytimes.com -- GreenC 20:41, 10 April 2021 (UTC)[reply]

Results

Edited 11,160 articles
Add 14,871 new archive URLs
Change metadata in 12,855 citations (eg. |work=)
Toggle 704 existing archives with |url-status=live --> |url-status=dead
Add 208 {{dead link}}
Various other general fixes

-- GreenC 00:25, 15 April 2021 (UTC)[reply]

articles.timesofindia.indiatimes.com links to timesofindia.indiatimes.com

Several years ago all the content on this subdomain was moved to timesofindia.indiatimes.com. However, the links are not the same and don't have any redirects and also cannot be re-constructed or guessed using any algorithms. One has to search in Google with the title of the link with former domain and update the link with the new domain.

LinkSearch

Old URL - http://articles.timesofindia.indiatimes.com/2001-06-28/pune/27238747_1_lagaan-gadar-ticket-sales (archived)

New URL - https://timesofindia.indiatimes.com/city/pune/film-hungry-fans-lap-up-gadar-lagaan-fare/articleshow/1796672357.cms

Is there a possibility for a WP:SEMIAUTOMATED bot with inputs from the user about the new url and update WP? Is there an existing bot? If not, I created a small semi-automated script (here) to assist me with the same functionality. Do I need to get an approval for this bot, if this is even considered a bot? -- Srihari Thalla (talk) 19:20, 8 April 2021 (UTC)[reply]

Are you seeing problems with content drift (content at the new page is different from the old). You'll need to handle existing |archive-url=, |archive-date= and |url-status= since can't change |url= and not |archive-url=, which if changed has to be verified working. There is {{webarchive}} that sometimes follow bare and square links might need removed or changed. The |url-status= should be updated from dead to live. There are {{dead link}} that might need to be added or removed. Should verify the new URL is working not assume it does; and if there are redirects in the headers capture those and change the URL to reflect. Those are the basics for this kind of work, it is not easy. Keep in mind there are 3 basic types of cites: those within a cite template, those in a square link, and those bare. Of those three types, the square and bare may have a trailing {{webarchive}}. All types may have a trailing {{dead link}}.

OR, my bot is done and can do all this. All that would be needed is a map of old and new URLs. There are as many as 20,000 URLs do you propose manually searching for each one? Perhaps better to leave unchanged and add archive URLs. Those that have no archive URL (ie. {{dead link}}) manually search for those to start. I could generate a list of those URLs with {{dead link}} while making sure everything else is archived. -- GreenC 20:24, 8 April 2021 (UTC)[reply]

If you already have the bot ready, then we can start with those that have no archive URL. If you could generate the list, I could also post on WP:INDIA asking for volunteers.

I would suggest to do this work, using a semi automated script ie., the script would read the page with the list and parse each row and print it on terminal (all details of the link possible, full cite/link title/etc) so that it would be easy for the user to search and once the new URL is found, the script takes the input and saves it to the page. Do you think this would be faster and convenient?

I would also suggest to form the list using the columns: serial number, link, cite/bare/square link, title (if possible), new url, new url status, new archive url, new archive url date. The last new ones being blank to be filled once researched. Does these columns look good?

Do you have a link to your bot? -- DaxServer (talk) 07:45, 9 April 2021 (UTC)[reply]

How about provide you with as much data as possible in a regular parsable format. I'd prefer not to create the final table as that should be done by the author of the semi-automated script based on its requirements and location. Does that sound OK? The bot page is User:GreenC/WaybackMedic_2.5 however it is 3 years out of date as is the GitHub repo, there have been many changes since 2018. The main bot is nearly 20k lines, but each URLREQ move request has its own custom module that is smaller. I can post an example skeleton module if you are interested, it is in Nim (programming language) which is similar to Python syntax. -- GreenC 18:24, 9 April 2021 (UTC)[reply]

The data in a parsable format is a good one to start with. Based on this, a suitable workflow can be established over time. The final table can be done later, as you said.

I unfortunately never heard of Nim. I know a little bit of Python and could have looked at Nim, but I do not have any time until mid-May. Would this be an example of a module, citeaddl? But this is Medic 2.1 and not 2.5. Perhaps you could share the example. If that looks like I can deal with without much learning curve, I would be able to workout something. If not, I would have to wait until end of May and then evaluate again! -- DaxServer (talk) 20:24, 9 April 2021 (UTC)[reply]

User:GreenC/software/urlchanger-skeleton-easy.nim is a generic skeleton source file. To give a sense of what is involved. It only needs modifying some variable at the top defining the domains old and new. There is a "hard" skeleton for more custom needs where mods are done throughout the file when the easy version is not enough. The file is part of the main bot, isolating domain-specific changes to this file. I'll start on the above it will take a few days probably depending how many URLs are found. -- GreenC 01:42, 11 April 2021 (UTC)[reply]

@DaxServer: The bot finished. Cites with {{dead link}} are recorded at Wikipedia:Link rot/cases/Times of India (raw) about 150. -- GreenC 20:57, 16 April 2021 (UTC)[reply]

Good to hear! Thanks @GreenC -- DaxServer (talk) 11:16, 17 April 2021 (UTC)[reply]

Results

Edits to 9,509 articles
New archive URLs added 15,269
Toggled 1,167 |url-status=live to |url-status=dead
{{dead link}} added about 100
11,941 cites changed metadata (eg. normalized |work=, removed "Times of India" from |title=)

odiseos.net is now a gambling website

There were two references to this website. I have removed one. The archived url has the content. Should this citation be preserved or removed?

My edit and existing citation -- DaxServer (talk) 07:50, 9 April 2021 (UTC)[reply]

This is a usurped domain. Normally they would be changed to |url-status=usurped. The talk page instance is removed because the "External links modified" section can be removed it is an old system no longer used. I'll need to update the InternetArchiveBot database to indicate this domain should be blacklisted but the service is currently down for maintenance. https://iabot.toolforge.org/ -- GreenC 17:10, 9 April 2021 (UTC)[reply]

I have also reverted my edit to include the |url-status=usurped (new edit). -- DaxServer (talk) 20:33, 9 April 2021 (UTC)[reply]

Migrate old URLs of "thehindu.com"

Old URLs from sometime before 2010 have a different URL structure. The content is moved to a new URL but a direct redirect is not available. The old URL is redirected to list page which is categorized by date the article is published. One has to search the title of the article and follow the link. Surprisingly, some archived URLs I tested were redirected to the new archived URL. My guess is that the redirection worked in the past, but was broken at some point.

Old URL - http://hindu.com/2001/09/06/stories/0406201n.htm (archived in 2020 - automatically redirected to the new archived url; old archive from 2013)

Redirected to list page - https://www.thehindu.com/archive/print/2001/09/06/

Title - IT giant bowled over by Naidu

New URL from the list page - https://www.thehindu.com/todays-paper/tp-miscellaneous/tp-others/it-giant-bowled-over-by-naidu/article27975551.ece

There is no content shift from the old URL (2013 archive) and new URL.

Example from N. Chandrababu Naidu - PS. This citation is used twice (as searched by the title), one with old url and the other with new url. -- DaxServer (talk) 14:18, 9 April 2021 (UTC)[reply]

The new URL [8] is behind a paywall and unreadable while the archive of the old URL [9] is fully readable. I think it would be preferable to maintain archives of the old URLs since they are not paywalled and there would be no content drift concern. Perhaps similar to above attempt to migrate when a soft-404 that redirects to a list page when no archive is available. -- GreenC 17:37, 9 April 2021 (UTC)[reply]

In that case, perhaps the WaybackMedic or the IA bot can add archived urls to all these links? If you want to be more specific, here is the regex of the URLs that I have found so far. There can be others which I have not encountered yet.

https?\:\/\/(www\.)?(the)?hindu\.com\/(thehindu\/(fr|yw|mp|pp|mag)\/)?\d{4}\/[01]\d\/[0-3][0-9]\/stories\/[0-9a-z]+\.htm

-- DaxServer (talk) 20:39, 9 April 2021 (UTC)[reply]

Can you verify the regex because I don't think it would match on the above "Old URL" in the segment \d{4}\/[01]\d\/[0-3][0-9]\/ .. maybe it is a different URL variation? -- GreenC 21:52, 9 April 2021 (UTC)[reply]

It matches. I checked it on regex101 and also on Python cli. Maybe, here is a simpler regex.

https?\:\/\/(www\.)?(the)?hindu\.com\/(thehindu\/(fr|yw|mp|pp|mag)\/)?\d{4}\/\d{2}\/\d{2}\/stories\/[0-9a-z]+\.htm -- DaxServer (talk) 12:02, 10 April 2021 (UTC)[reply]

Ahh got it sorry misread thanks. -- GreenC 13:33, 10 April 2021 (UTC)[reply]

Regex modified to work with Elasticsearch insource: and some additional matches. 12,229

insource:/\/{2}(www[.])?(the)?hindu[.]com\/(thehindu\/)?((cp|edu|fr|lf|lr|mag|mp|ms|op|pp|seta|yw)\/)?[0-9]{4}\/[0-9]{2}\/[0-9]{2}\/stories\/[^.]+[.]html?/

-- GreenC 04:27, 17 April 2021 (UTC)[reply]

DaxServer, the Hindu is done. Dead link list: Wikipedia:Link rot/cases/The Hindu (raw). -- GreenC 13:24, 23 April 2021 (UTC)[reply]

Great work @GreenC !! -- DaxServer (talk) 16:58, 23 April 2021 (UTC)[reply]

Edits to 11,985 articles
New archive URLs added 15,954
Toggled 2,412 |url-status=live to dead
Added 1,244 {{dead link}}
12,234 cites changed metadata (eg. normalized |work=, removed "The Hindu" from |title=, etc.)
Updated the IABot database, each link individually set blacklisted.

sify.com

Any link that redirects to the home page. Example. Example.-- GreenC 14:27, 17 April 2021 (UTC)[reply]

Results

Add 4,132 new archive links (Example)
Add or modify 1,149 |url-status=dead (Example)
Set links "Blacklisted" in IABot database

ancient.eu

Ancient History Encyclopedia has rebranded to World History Encyclopedia and moved domain to worldhistory.org. There are many references to the site across Wikipedia. All references pointing to ancient.eu should instead point to worldhistory.org. Otherwise the URL structure is the same (ie. https://www.ancient.eu/Rome/ is now https://www.worldhistory.org/Rome/). — Preceding unsigned comment added by Thamis (talk • contribs)

Hi @Thamis:, thanks for the lead/info, this is certainly possible to do. Do you think there is a reason to consider Content Drift ie. the page at the new site is different from the original (in substance), or largely a 1:1 copy of the core content? Comparing this page with this page it looks like this is a administrative change and not a content change. -- GreenC 23:40, 20 April 2021 (UTC)[reply]

Thanks for looking into this, @GreenC:. There's no content drift, it's a 1:1 copy of the content with the exact same URLs (just the domain is different). When I compare the two Rome pages from the archive and the new domain that you linked, I see the exact same page. The same is true for any other page you might want to check. :-)

@Thamis:, this url works but this url does not. The etc.ancient.eu sub-domain did not transfer, but still works at the old site. For these it will skip as the link still works and I don't want to add an archive URL to live links if it will be transferred in the future to worldhistory.org. Can be revisited later. -- GreenC 16:03, 23 April 2021 (UTC)[reply]

@GreenC: Indeed, that etc.ancient.eu subdomain was not transferred. It's the www.ancient.eu domain that turned into www.worldhistory.org -- subdomains other than "www" should be ignored.

@Thamis: it is done. In addition to the URLs it also changed/added |work=, etc.. to World History Encyclopedia. It got about 90%, but the string "Ancient History Encyclopedia" still exists in 89 pages/cites, they will require manual work to convert (the URLs are converted only the string is not). They are mostly free-form cites with unusual formatting and would benefit from manual cleanup probably ideally conversion to {{cite encyclopedia}}. -- GreenC 01:07, 24 April 2021 (UTC)[reply]

Results

Edited 759 articles
Converted 917 URLs (Example)

*.in.com

Everything dead. Some redirect to a new domain homepage unrelated to previous site. Some have 2-level deep sub-domains. All now set to "Blacklisted" in IABot for global wiki use, a Medic pass through on enwiki will also help. -- GreenC 04:13, 25 April 2021 (UTC)[reply]

Results

Edited 3,803 articles
Added 3,863 new archive URLs
Changed/added 732 |url-status=dead to existing archive URLs
Added 104 {{dead link}}
Set individual links "Blacklisted" in IABot database

Remove oxfordjournals.org

Hello, I think all links to oxfordjournals.org subdomains in the url parameter of {{cite journal}} should be removed, as long as there's at least a doi, pmid, pmc, or hdl parameter set. Those links are all broken, because they redirect to an HTTPS version which uses a certificate valid only for silverchair.com (example: http://jah.oxfordjournals.org/content/99/1/24.full.pdf ).

The DOI redirects to the real target URL, which nowadays is somewhere in academic.oup.com, so there's no point in keeping or adding archived URLs or url-status parameters. These URLs have been broken for years already, so it's likely they will never be fixed. Nemo 07:13, 25 April 2021 (UTC)[reply]

About 15,000. I have been admonished for removing archive URLs because of content drift ie. the page at the time of citation contains different content then the current one (academic.oup.com), therefore the archive URL is useful for showing the page at time of citation for verification purposes. OTOH if there is reason to believe content drift is not a concern for a particular domain, that is not my call to make some else would need to do that research and determine if this should be of concern. @Nemo bis: -- GreenC 16:03, 25 April 2021 (UTC)[reply]

The "version of record" is the same, so the PDF at the new website should be identical to the old one. The PubMed Central copy is generally provided by the publisher, too. So the DOI and PMC ID, if present, eliminate any risk of content drift. On the other hand, I'm pretty sure whoever added those URLs didn't mean to cite a TLS error page. :) Nemo 18:21, 25 April 2021 (UTC)[reply]

I can do this, just will need some time thanks. -- GreenC

Fix pdfs.semanticscholar.org links

The pdfs.semanticscholar.org which HTTP 301 redirect to www.semanticscholar.org URLs are actually dead links. There are quite a few now. A link to the wayback machine is possible, but I believe the InternetArchiveBot would not normally add it. Nemo 21:15, 28 April 2021 (UTC)[reply]

They are soft 404 in the sense the landing page is 200 and serves related content, but not what is expected from the original URL (ie. a PDF). We can restore the PDF via the WaybackMachine and another archive providers as archive URLs. Being 404ish links, they should be saved as originally intended for WP:V purposes. If the citation already has an archive link it will be skipped. If no archive link can be found it will leave the URL in place and let Citation bot handle it - can generate a list of these there probably will not be many. -- GreenC 21:29, 28 April 2021 (UTC)[reply]

Makes sense, thank you! Nemo 06:42, 29 April 2021 (UTC)[reply]

Nemo, testing going well and about ready for the full run. There are a number of edge case types found that required special handing so good thing this is custom. Question: do you know if with this diff would Citation bot then keep the archive URL or remove it? -- GreenC 16:51, 29 April 2021 (UTC)[reply]

Those diffs look good. As far as I know, at the moment Citation bot is not removing those URLs; I've tested on a few articles after your bot's edits and they were left alone. Nemo 04:38, 30 April 2021 (UTC)[reply]

Nemo, looks done, let me know if you see any problems. -- GreenC 16:43, 30 April 2021 (UTC)[reply]

Thank you! Wikipedia:Link rot/cases/pdfs.semanticscholar.org is super useful. I noticed that OAbot can find more URLs to add when a DOI is available and the URL parameter is cleared. So I think I'll do another pass with OAbot by telling it to ignore the SemanticScholar URLs, and then I'll manually remove the redundant ones. Nemo 20:48, 1 May 2021 (UTC)[reply]

Results

Edited 2,754 articles
Added 3,204 new archive URLs for pdfs.semanticscholar.org
Add/change 74 |url-status=dead in preexisting archive URLs
485 URLs no archives found: Wikipedia:Link rot/cases/pdfs.semanticscholar.org
Updated IABot database. Blacklisted above archived URLs while retain whitelist for remaining URLs in the domain.

TracesOfWar citations update

Wikipedia currently contains citations and source references to the websites TracesOfWar.com and .nl (EN-NL bilingual), but also to the former websites ww2awards.com, go2war2.nl and oorlogsmusea.nl. However, these websites have been integrated into TracesOfWar in recent years, so that the source reference is now incorrect in hundreds of pages and a multiple of that in terms of the source references. Fortunately, there is currently the situation in which ww2awards and go2war2 still redirct to the correct page on TracesOfWar, but this is no longer the case for oorlogsmusea.nl. I have been able to correct all the sources for oorlogsmusea.nl manually. For ww2awards and go2war2 the redirects will stop in the short term, which will result in thousands of dead links, while it can be properly directed towards the same source. A short example: person Llewellyn Chilson (at Tracesofwar persons id 35010) now has a source reference to http://en.ww2awards.com/person/35010, but this must be https://www.tracesofwar.com/persons/35010/. In short, old format to new format in terms of url, but same ID.

In my opinion, that should make it possible to convert everything with format 'http://en.ww2awards.com/person/[id]' (old English) or 'http://nl.ww2awards.com/person/[id]' (old Dutch) to 'https://www.tracesofwar.com/persons/[id]' (new English) or 'https://www.tracesofwar.nl/persons/[id]' (new Dutch) respectively. The same applies to go2war2.nl, but with a different format slightly. http://www.go2war2.nl/artikel/[id] becomes https://www.tracesofwar.nl/articles/[id]. The same has already been done on the Dutch Wikipedia, via a similar bot request. Lennard87 (talk) 18:50, 29 April 2021 (UTC)[reply]

Reuters

The new Reuters website redirected all subdomains to www.reuters.com and broke all links. That's about 50k articles on the English Wikipedia alone, I believe. I see that the domain is whitelisted on InternetArchiveBot, not sure whether that's intended. Nemo 20:13, 1 May 2021 (UTC)[reply]

Bot might convert links to httpS?

mcleans.ca

Replace Airdisaster.com links

observer.com

sfsite.com/~silverag

Illinois Historic Preservation Agency

whitehouse.gov

StarWars.com

Replace atimes.com links

www.geek.com

unc.edu

nytimes.com links to All Movie Guide content

articles.timesofindia.indiatimes.com links to timesofindia.indiatimes.com

odiseos.net is now a gambling website

Migrate old URLs of "thehindu.com"

sify.com

ancient.eu

*.in.com

Remove oxfordjournals.org

Fix pdfs.semanticscholar.org links

TracesOfWar citations update

Reuters

Replace `atimes.com` links