Wikipedia talk:WikiProject External links/Webcitebot2

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search


Summary as of 03:36, 2 March 2011 (UTC)
Wikipedia has 17.5 million links to external websites (see list 844 MB download). I estimated an additional 3,700 links added each day (see link feed at #wikipedia-en-spam). The links are often used in citations and should be archived before they go dead. Around 110.000 articles are tagged for dead external links now and this number has been steadily increasing for the last two years (graph). Several solutions have been proposed and are summarized below.
Option Status S O N Notes
Wikimedia foundation starts its own archivation project Proposed 7 4 0 ♦ A larger, similar project is proposed at meta:WikiScholar
Links are archived at WebCite and replaced by a bot Ready now 9 1 0 ♦ WebCite can whitelist bots for full speed operation
Δ has functioning WebCiteBot here and is in contact with WebCite
Tim1357 has a WebCiteBot almost complete and is in contact with WebCite
Nn123645 has a WebCiteBot in development
ThaddeusB is working to get the original WebCiteBOT running again
Links are archived at Archive it Under discussion 3 0 1 ♦ Internet Arcive is ready to help and wants to work out details
Gwern and Hydroxonium are in email contact with them
Links are archived at Wikiwix and a sitewide script adds archive links to every external links (example) Ready, but not yet enabled (needs community approval first) 6 0 0 Original RfC is closed. New RfC for small scale test is under discussion here
♦ Wikiwix has been in use on fr.wikipedia for about 2 years
♦ Add fr:Utilisateur:Pmartin/cache.js to your vector.js to use/test
♦ Add fr:Utilisateur:Pmartin/cache.js to MediaWiki:Common.js for everybody (requires consensus from the community)
♦ Javascript required to use this tool
♦ Pmartin will provide backups of archived webpages

I have received updated information, so I decided to add a summary. Everybody is welcome to update this list. - Hydroxonium (H3O+) 18:26, 11 February 2011 (UTC)

The summary is routinely being updated by several users. - Hydroxonium (H3O+) 22:45, 14 February 2011 (UTC)
Added vote count (S)upport (O)ppose (N)eutral. - Hydroxonium (H3O+) 13:44, 17 February 2011 (UTC)
I've updated the summary. - Hydroxonium (H3O+) 03:36, 2 March 2011 (UTC)


How are sister projects started? We have Wikinews, Wikiquote, Wikibooks, etc.. How about Wiki-citation? - Hydroxonium (H3O+) 18:03, 4 February 2011 (UTC)

That sounds good and seems to be such a logical step, I wonder how this has not been implemented yet. Would the costs of additional "archive servers" rise Wikimedia Foundations expenses that dramatically? If not, it would be an idea to just provide a tool (for example in the WP:Toolbox) that can perform the archiving. Maybe it would be possible to use some of WebCite's archive software and algorithms? I don't know if they are patented or something. Toshio Yamaguchi (talk) 18:51, 4 February 2011 (UTC)
No patents, and WebCite is open source. Yes, in principle this could be rebranded as Wiki-citation, but there are all sorts of legal threats and operational issues (dealing with user support, DMCA/takedown request associated with running such a service, which is why I think it makes sense to keep WebCite running as a stand-alone entity, and to aim for a memorandum of understanding between WMF and WebCite. --Eysen (talk) 15:42, 8 February 2011 (UTC)
I believe that you write a proposal at meta:Proposals for new projects, and hope people are excited about it. However, I'm not convinced this would fly with the foundation. "Wiki-citation" is basically "Wiki-nothing-but-copyright-violations". Such a thing might be fair use, but that is not at all clear under current laws and treaties, and the only way to clear that up is very expensive. The foundation would have to be willing to risk the possibility of a major legal challenge. I simply don't think that they will want to undertake that risk. WhatamIdoing (talk) 20:12, 8 February 2011 (UTC)
It's no more unclear than Wikipedia itself or copyleft is. The Internet Archive and WebCitation and Google aren't exactly trying to fly under the radar. --Gwern (contribs) 20:25 8 February 2011 (GMT)
Yes but the have much larger legal teams than we do here. Doc James (talk · contribs · email) 13:44, 9 February 2011 (UTC)
No, Google is a completely different situation: for one thing, it's temporary, and for another, the point is to drive traffic to your website ('to help you make money'). you can opt out, but basically no one does (for their whole website, at any rate: many people block bots from some sections).
Copying the material to our own website would have the opposite effect. Say you write an e-book, and you post a chapter online, along with a 'buy my book now' button. I copy the chapter and put it on my website (clear copyright violation) without any 'buy his book now' button. As a result of our copying your work, you get less traffic, less ad revenue, and less book sales. The commercial harm to you weighs heavily in fair use determinations. The argument that 'well, maybe some day he might choose to take the website down, and then I wouldn't know what it said' weighs very little at all, especially before the website is actually taken down.
It also creates control problems: If you decide to take the page down—possibly because you actually no longer want anyone to have the book*—your work is still available to the world (and not just until Google updates its cache).
(*True story: I knew a university professor who photocopied an entire, rather rare book every year. The copyright owner had fallen out with the author, had doubts about the book's accuracy, and absolutely refused to re-print the book at any price. There was apparently (at the time) a provision in US copyright law that exempted schools from liability if the violation was done more or less accidentally (i.e., the professor was so absent-minded that he didn't realize it was a copyright violation). So every time this course came up, he announced to the department secretary that he was 'spontaneously and without thinking it through' making a dozen copies of the entire book.)
IMO what protects the two archives is that they have almost no assets. This is not true of the WMF: they own one of the biggest names on the web.
And copyleft is irrelevant: That's the voluntary choice of the copyright owner, not the copyright violater. If you have the copyright, I can't declare that it's available under copyleft and make a bunch of copies just because I want to. WhatamIdoing (talk) 18:22, 23 February 2011 (UTC)

"Single point of failure" concerns[edit]

The Wikiwix solution appears to be the easiest solution, as somebody else does the work for us. It has also been reliable for over 2 years on fr.wikipedia. But I have a concern that this is still a single point of failure, which is how we got in to this situation when the original WebCiteBOT went down. I would like to see at least 2 solutions implemented so that we don't come crashing down if something happens to one of them. I would like to get everybody's input on this. Thanks. - Hydroxonium (H3O+) 18:03, 10 February 2011 (UTC)

I agree redundancy here definitely won't hurt. Unlikely some other things in many cases there is no way to recover from data loss. If something is only published on the web and it goes down you may not be able to replace it. This is especially problematic if it is the only source for a particular statement in an article. --nn123645 (talk) 19:42, 10 February 2011 (UTC)
I'll make sure to ask more details about it. But it is certainly a common concern that Pmartin should know well, as a search engine provider. If I recall correctly, Pmartin did offer to provide a backup of the data in case they would stop their archive service - which is not supposed to happen in the years to come, but he offered it as a safety precaution. But this is certainly not enough. So Pmartin may agree to provide regular backups to the WMF, I have yet to discuss this issue with him. Yours, Dodoïste (talk) 21:41, 11 February 2011 (UTC)
Update: Pmartin agreed to provide regular backups of the Wikiwix archive to the WMF, or to a trusted member of the Wikimedia community. Thus, the single point of failure is no longer a concern. Cheers, Dodoïste (talk) 21:10, 14 February 2011 (UTC)
The Wikiwix solution is not only the easiest solution, it's also the cheapest. By using one of the other proposed solutions as a backup, we'll be spending a lot of time and potentially a significant amount of money on something we may never need to use and, in the case of WebCitation, may be less reliable than our primary method. Data dumps of the Wikiwix data are probably a better solution than running 2 systems in parallel. While linkrot is a problem, I don't think it's such a massive problem that it would justify spending weeks of development time and/or thousands of dollars just on a backup. Mr.Z-man 22:14, 11 February 2011 (UTC)
17 million links. Most of which will be dead within a decade or two. That's not worth a few thousand dollars? Maybe you should look at the Foundation reports because it seems to me that it spends a lot more on things a lot less important. --Gwern (contribs) 23:42 11 February 2011 (GMT)
A few thousand dollars for the primary system maybe, but for the backup? Mr.Z-man 00:35, 12 February 2011 (UTC)
A backup is mainly just a function of how much disk space is available. Hard drive space is approaching $0.03 per GB (on google shopping I just found a 2 TB 5600 RPM 300 MB/s drive for $80 here ). Considering your average web page is around 300 KB with images that's not really that much. Remember we aren't talking about archiving the entire web, just a very small portion of it. --nn123645 (talk) 15:10, 14 February 2011 (UTC)
If we're just saving a copy of the Wikiwix data (which is what I think we should do), it wouldn't cost much. But if we're developing our own system, paying for Internet Archive's service, or funding WebCitation (which is what some people seem to be advocating), it will be significantly more. Hard drive space would be the least of the costs. Mr.Z-man 21:46, 14 February 2011 (UTC)
  • Failover backup web and database servers in case of a single point of failure are standard and easy. Whichever method is used we could probably negotiate such a service if we needed it. But multiple archiving is probably the best long term though. FT2 (Talk | email) 02:57, 17 February 2011 (UTC)
  • There's no real reason to make choices between archives if more than one is willing to take the data. It's well understood that "Lots Of Copies Keeps Stuff Safe" (LOCKSS), whether against changes of local laws, physical loss of the archive, or simple loss of the will to fund and operate them. Archives already recognize this principle with mirroring arrangements. We ideally should avoid using a single archive. Embracing this idea has other advantages: we can divide up the initial workload of archiving and indexing, thus getting more rapidly to the step where at least one archive covers each URL we use. Then we can move on to getting redundancy (for those archives haven't already addressed that).
  • Question re the scope behind the 17 million figure: is this the number of unique external URLs appearing in references within articlespace of English-language Wikipedia? LeadSongDog come howl! 06:01, 20 February 2011 (UTC)
This is my understanding, but I may be wrong.
  • They are mostly unique, although I'm sure there are duplications.
  • They are all external links, both in citation templates and otherwise.
  • Only from English Wikipedia.
  • Only from article space.
nn123645 had EdoDodo run the request on the toolserver and came up with 17,462,350 links. The 844 MB (compressed) file can be downloaded here. - Hydroxonium (H3O+) 14:31, 20 February 2011 (UTC)
That is correct. The query to get that number took over an hour to run. I don't think it would be practical to try to find the number of unique queries in references via a single query (references would requiring querying page text, something you can't directly do via SQL on the toolserver). Basically that is run off the external links table, which adds the link once to the table for every page it is used on. So if a link is used 30 times on 20 pages, it would appear in the external link table 20 times. The only practical way to get that number would be to go through and parse every page for it, or using the magic of statistics a portion of the pages. --nn123645 (talk) 16:47, 24 February 2011 (UTC)
Thank you, that is helpful, particularly in ruling out template space and talkspaces and in that it gives us a mean of about five article-EL pairs per article. It might be instructive to run a similar SQL query to return results grouped by article class. Our greatest effort goes into FA- and GA-class articles. As they are much the smallest portion of our ~3.5 million articles I'd suggest these should be at the head of any processing queue. It would much reduce the risk of losing good work while we get a better handle on the balance of the problem. LeadSongDog come howl! 18:20, 24 February 2011 (UTC)
As ever, we should put a premium on getting a solution - any solution - working as fast as possible. If prioritizing FAs or GAs (not that the latter are especially good) will slow down us down by even a day, then it's a terrible no-good 'letting the perfect be the enemy of the better' idea. Every day we wait links are expiring! --Gwern (contribs) 19:07 24 February 2011 (GMT)
I'd agree if it was possible to just turn a key and have all 17 million done. But we have no evidence that is the case. It is going to take a while to crunch through these, especially at the snails pace that presently permits. I'd rather that the crap articles are at risk until the gems are secured. LeadSongDog come howl! 19:42, 24 February 2011 (UTC)

Alternative solution[edit]

There's some opposition to the Wikiwix RfC. The reason an RfC is needed for the proposed Wikiwix solution is because it requires modifying the Wikimedia interface. I think an easier and less controversial option is to modify the {{Citation/core}} template. This would change all the major citation templates, such as {{citation}}, {{cite news}}, {{cite web}}, etc.. This could be modified in a few days after some thorough testing.

The 4 major caching systems (Wikiwix, WebCite, Internet Archive and Google's cache) can all be used by adding a prefix to a webpage's URL. These would be in the form of:


This could show up in a citation looking something like this.

  1. Doe, John (January 1, 2010), "title of some citation", New York Times, archived at Wikiwix, WebCite, Internet Archive, Google

This would provide links to cached versions of webpages on 4 different services, and so should please most people. Obviously, Wikiwix, WebCite, Internet Archive and Google would still have to archive the webpages in the first place, otherwise they would come up 404. Also, the link for Internet Archive will come up with a list of all of their cached pages instead of a specific cached page, but this may be acceptable.

I believe this would be easier to implement and a lot less controversial. Anybody have any opinions on this? Thanks. - Hydroxonium (H3O+) 01:22, 15 February 2011 (UTC)

Bad idea, only link to it if such archives exist, Ive got working code for webcitation and am working with them for a whitelisting of my IP. once that is done I can kick this into full gear. ΔT The only constant 01:25, 15 February 2011 (UTC)
I think linking to a page that will 404 is better than not linking to it at all, but instead of saying "archived at" we should probably say something like "archives may be available at". We could always include a parameter to disable this functionality or to have it list only those for which archives exist, but I think implementing something like this would dramatically increase the adoption of people manually archiving stuff with webcite. Another alternative would be linking to a page on the toolserver with a list of archival services, similar to how we do with {{coord}}. --nn123645 (talk) 03:14, 15 February 2011 (UTC)
I think a better idea would be to use something like the Wikiwix script, linking to a toolserver page, or doing some fancy pop-up box with the links. 1) It wouldn't be making one of the biggest, slowest templates even more complex, 2) it wouldn't clutter up the page as much (imagine adding 4 extra links for each ref on an article with 100 references), and 3) it could be used for all links; given that Wikiwix, Google, and IA archive basically everything, there's no reason to limit ourselves to those specifically used in citation templates, especially since citation templates aren't required for references. Mr.Z-man 03:23, 15 February 2011 (UTC)
Well I definitely agree that would probably be a better solution, but as it stands right now consensus seems to be in favour of no implementation. If this is a compromise that can get the approval of the community it would be a hell of a lot better than nothing. --nn123645 (talk) 03:34, 15 February 2011 (UTC)
Can we at least not present it as "easier to implement" when it is in fact significantly more work? What I just suggested is functionally identical to what Hydroxonium proposed except it uses a script instead of modifying a template. Citation/core is a fully protected template used on more than a million pages, there isn't a substatnail difference between editing a template like that and using a script (except that a script will be better, performance-wise). Mr.Z-man 03:52, 15 February 2011 (UTC)
Again I fully agree and strongly support a script, but it really comes down to what the community is/isn't willing to support. If the only way this is going to get approved is if a bot goes through and make a few million or so pointless edits just because that is what people are comfortable with then that is still better than nothing at all. I'm just going on the basis that we have 40% support to 60% oppose in the RFC right now. Though I think it's too early to say what the outcome will be as there has not been enough outside input. --nn123645 (talk) 04:41, 15 February 2011 (UTC)

Avoid adding these extra words to every cite. An easier solution is a single extra link or icon "Archived..." which when hovered (or clicked) produces a small popup with a list of archives for that link. Not sure about the javascript aspect. Is that really desirable? FT2 (Talk | email) 02:57, 17 February 2011 (UTC)

I don't understand why there must be a core javascript change for Wikiwix to become usable. Why not simply run a bot that adds the archiveurl and archivedate parameters to citations that have been archived? There's no reason why WebCite and Wikiwix cannot both be used...Wikiwix can eventually supplant everything as a primary archival service since archiving is automatic, but there's no reason to not continue to do secondary archiving with WebCite, and of course use Internet Archive for old copies of websites. They can exist side by side, providing redundancy that is badly needed. Huntster (t @ c) 12:21, 18 February 2011 (UTC)

Using the template parameters makes sense if we're only archiving a few things, but if we're going to archive everything, why make millions of edits when we can do it in one? A script also doesn't exclude references that don't use the templates. Mr.Z-man 12:32, 18 February 2011 (UTC)
The bot would be capable of dealing with citations that don't use the templates. That's not an issue. It just doesn't seem like any consensus will emerge to change the core script, so we need to work towards finding alternate paths rather than discard Wikiwix (which I'm afraid will happen should the script idea fail). Huntster (t @ c) 08:25, 19 February 2011 (UTC)
Yes, it is an issue. A bot cannot reliably determine what is and is not a reference. Even if it puts it on every link in ref tags, it's still going to miss some. Not to mention that a bot is basically the most inefficient solution possible short of doing the bot's task manually. Using more template parameters also has the possibility of increasing page load time on articles that already take 30+ seconds to parse because of all the citations. Currently only 14 people have commented on the RFC, perhaps we should get some more opinions before we intentionally make a bad decision. Mr.Z-man 13:00, 19 February 2011 (UTC)
I am not entirely happy with it but I have created User:Allen4names/Query web cache. It would require an edit to MediaWiki:common.css (or your own CSS file) to show which archives that need not be queried. Still I find this preferable to the Wikiwix script. – Allen4names 04:09, 21 February 2011 (UTC)
Since the page loading performance is discussed, I feel I should tell about the performance of the script. The script itself is short, but the function it uses to add the archive link after each external link can take some time on long articles. Thus, at the French Wikipedia, the default script is restricted to the reference section in the main namespace, in order to prevent long loading times on lengthy articles. If users want the archive displayed with every external link, they can use the corresponding gadget.
At any rate, page loading seems less of an issue with the script than with the template overload. The script is executed once the page is already loaded and displayed. Template overloads the servers, which affects drastically the time an uncached page needs to be prepared by the server before it is send to the user. Yours, Dodoïste (talk) 20:26, 23 February 2011 (UTC)

I love it. I absolutely love it. I don't have javascript (long story) so the script based solutions won't help me. But this will. Thank you so much. - Hydroxonium (H3O+) 05:51, 21 February 2011 (UTC)

You're welcome. I am going to move it to Template:Query web archive. – Allen4names 17:18, 21 February 2011 (UTC)
Just tried a little test and a big test and it works great! I found a link for a previously dead webpage using Internet Archive's new Wayback beta. It also seems that just running a query at Wikiwix will automatically archive a page in many cases. This is wonderful. Thanks very much - Hydroxonium (H3O+) 00:08, 22 February 2011 (UTC)

Original WebCiteBOT is back[edit]

Or at least it will be unless people object.

I have just returned to Wikipedia and getting the bot back up and running will be my first priority unless I am told it is no longer needed/wanted.

Thanks, ThaddeusB (talk) 23:28, 22 February 2011 (UTC)

P.S. I am certainly in favor of their being multiple options to prevent and fix dead links, so these is in no way meant to discourage the alternatives in development. --ThaddeusB (talk) 23:30, 22 February 2011 (UTC)

Good to have you back Thaddeus. Huntster (t @ c) 00:23, 23 February 2011 (UTC)
Welcome back, ThaddeusB. Your bot is desperately needed and very much wanted and appreciated. We missed you dearly. This is GREAT news for all of us. Thanks very much for the help. - Hydroxonium (H3O+) 00:55, 23 February 2011 (UTC)
I agree, very good news. This bot is in fact much needed. Welcome back ThaddeusB. Toshio Yamaguchi (talk) 01:05, 23 February 2011 (UTC)
Welcome back! The bot is very much needed/wanted. I note from past discussions that its current code is proprietary, which begs the question of its price tag.   — Jeff G.  ツ 22:41, 23 February 2011 (UTC)
I didn't release the code for reuse for personal reasons (might theoretically reuse some code for commercial purposes in the future), but there certainly is no price tag associated with it. :) --ThaddeusB (talk) 16:16, 25 February 2011 (UTC)
Ive also got a program that Im about to kick into high gear, working with webcitation, which should also make this partnership with wikiwix pointless. It should be up and fully operational within a week. ΔT The only constant 01:53, 25 February 2011 (UTC)
Where might one find its BRFA?   — Jeff G.  ツ 02:35, 1 March 2011 (UTC)
It is linked from bot's user page. —  HELLKNOWZ  ▎TALK 08:32, 1 March 2011 (UTC)
I think Jeff G. might have been asking about a BRFA for Δ's program that he is "about to kick into high gear". Anomie 15:24, 1 March 2011 (UTC)
I was indeed.   — Jeff G.  ツ 03:12, 23 March 2011 (UTC)

I'm planning a small test run tomorrow. I will post a link to the results when I have some. --ThaddeusB (talk) 16:16, 25 February 2011 (UTC)

And... ? emijrp (talk) 00:46, 21 April 2011 (UTC)

New RfC for Wikiwix[edit]

SJ has suggested starting a new RfC for Wikiwix with a small scale test of one category for one month so that users can see how it will work (see here). A couple people have agreed and I would like to close the current RfC early. I would like to get input from everybody before I start on this.

  1. Does anybody object to closing the current Wikiwix RfC early?
  2. What would people like to see in the next Wikiwix RfC?

Any input is greatly appreciated. - Hydroxonium (H3O+) 18:12, 24 February 2011 (UTC)

Discussion for the new RfC has moved here. - Hydroxonium (H3O+) 00:46, 4 March 2011 (UTC)

Draft RfC[edit]

I started drafting a new RfC several times and decided its better to have the community write it. So I have essentially a blank page here and would like to get others help in writing the new RfC. Everybody is encouraged to edit the page. Thanks. - Hydroxonium (TCV) 10:20, 16 March 2011 (UTC)

A new WebCiteBOT[edit]

Hi all. I'm working in a new WebCiteBOT. I have opened a request for approval. It is free software and written in Python. I hope we can work together on this. Archiving regards. emijrp (talk) 17:15, 21 April 2011 (UTC)

Awesome, sounds good. I have some questions about it though. It says it fills in the cite web parameter in Template:Cite web. But I have some questions. Does that mean it adds the WebCite url as the archiveurl parameter? Also, which citation formats does it support? Toshio Yamaguchi (talk) 17:39, 21 April 2011 (UTC)
Yes, it adds the url to archiveurl=. By now, the bot only works with bareurl references, but I want to add support to {{cite web}} citations too. See the conversion table at the RFA and suggest new conversions. Regards. emijrp (talk) 18:38, 21 April 2011 (UTC)

RfC to add dead url parameter for citations[edit]

A relevant RfC is in progress at Wikipedia:Requests for comment/Dead url parameter for citations. Your comments are welcome, thanks! —  HELLKNOWZ  ▎TALK 10:49, 21 May 2011 (UTC)

Good news[edit]

After a very long absence, I am back and have restarted the original WebCiteBOT. The code required several tweaks, but not a major rewrite or anything. Initial tests are now underway no make sure no further tweaks are needed. Interested parties can track the bot's Logs or contributions. Feedback is, of course, welcome.

I will post another update here when the bot is running full time. More frequent updates may be available on bot's talk page. --ThaddeusB (talk) 01:20, 18 February 2012 (UTC)

Good to hear. Welcome back ThaddeusB. -- œ 08:31, 22 February 2012 (UTC)
Indeed, welcome back. I hope you can stick around, as this is a critical capability for the project. LeadSongDog come howl! 18:36, 22 February 2012 (UTC)

Does this still need doing?[edit]

Are there any bots running now?

--Tim1357 talk 17:32, 12 November 2012 (UTC)

Well, there's an open BRFA here, but it seems to be facing the same roadblock that closed your bot request. Σσς(Sigma) 20:35, 12 November 2012 (UTC)
In case any still monitors this page, I just wanted to say the original WebCiteBOT is back up and running again (just tests at the moment). --ThaddeusB (talk) 05:46, 20 June 2013 (UTC)