Jump to content

Wikipedia:Requests for comment/Archive.is RFC

From Wikipedia, the free encyclopedia

Recent events related to archive.is have left Wikipedia's links to that service in a state that requires a community decision.

Background

"Archive.is" is a website that functions similarly to the more established Wayback Machine: Both provide an archiving service whereby snapshots of web pages across the internet are saved in a vast repository. In case archived pages become unavailable at their original locations, or their content is removed or changed, these archive services provide a static backup of each page, each of which can be linked to with presumably more assurance that their content will remain online and intact. Compared to Wayback Machine, which is much older and established, Archive.is is a newer competing service. Wikipedia articles have commonly used links to Wayback Machine's version of web pages for use in their references in order to combat link rot.

A bot called RotlinkBot, created by User:Rotlink, has recently begun linking Wikipedia articles to the new Archive.is service. This bot was not approved, and was therefore subsequently blocked.

Following this block, the bot was used in an anonymous operation using IPs from three different Indian states, Italy, Hong Kong, Vietnam, Bulgaria, Qatar, Latvia, Hungary, Slovakia, Romania, Brazil, Argentina, Portugal, Spain, France, Mexico, Austria, and South Africa, raising strong suspicions that the IPs were not being used legally. These IPs, and User:Rotlink, self-identified as the owner of archive.is, were subsequently blocked. Rotlink has not commented on any of the blocks.

Over 10,000 links to archive.is remain on Wikipedia.

Points to consider

  1. Archive.is is a relatively young archiving service.
  2. No one has found any problems with the quality of archived links. So far as anyone can determine, archive.is is presenting an accurate record of all material it claims to archive.
  3. In this discussion, User:Rotlink identifies himself as the owner of archive.is.
  4. Rotlink wrote User:RotlinkBot, a bot which created links. It was unapproved, and blocked because of unapproved operation. Again, the bot seemed to operate reasonably well: minor defects were noted, but nothing serious. The motivation for the block was the unapproved operation.
  5. RotlinkBot did not exclusively add links to archive.is: it added links to other archiving sites as well, and apparently in preference to archive.is in some cases.
  6. On September 3, 2013, 94.155.181.118 (talk · contribs · deleted contribs · logs · filter log · block user · block log) began inserting links to archive.is, as well as links to other archive sites. This appears to be RotlinkBot running anonymously.
  7. By September 17, 2013, the list of IPs that were inserting had grown. It included at least the following:
  8. This list of IPs included three different Indian states, Italy, Hong Kong, Vietnam, Bulgaria, Qatar, Latvia, Hungary, Slovakia, Romania, Brazil, Argentina, Portugal, Spain, France, Mexico, Austria, and South Africa.
  9. Based on that pattern of IPs, User:Kww concluded that not only was RotlinkBot being used anonymously in violation of its block, but that the IPs being used were likely to be anonymous proxies or a similar form of botnet. He blocked Rotlink, all of the IPs, and a few more IPs that were discovered later.
  10. He called for edits by the IPs to be rolled back at WP:ANI: https://en.wikipedia.org/w/index.php?title=Wikipedia:Administrators%27_noticeboard/Incidents&oldid=573791554#Mass_rollbacks_required
  11. Many editors and admins reverted.
  12. At this point, over 10000 links to archive.is remain in Wikipedia.
  13. At this point, User:Kww has no firm proof of illegal activity, although he remains of the opinion that this is likely.
  14. User:Rotlink has made no comment in respect to his block.

The current situation is awkward. It's impractical to place the link on the spam blacklist, because the spam blacklist will interfere with editing any of the articles that contain a link to archive.is. It seems strange to have so many links, but to claim that no more links can be added. Several editors view the rollbacks themselves as destructive. We need to figure out how to go forward.

There would appear to be several options.

Options

1. Remain where we are

Wikipedia is notoriously inconsistent, and this is just one more case. There's no need for blacklisting, no need to remove the existing links, and no need to restore links that were removed due to the improper bot use.

  1. Support Per my comment in the discussion section below. I put a lot of work into changing over broken links to archive.is links when pinkpaper.com went offline. I don't see why my hard work should be undone because of someone else's misbehaviour. If someone spammed links to BBC News all over Wikipedia, and we found it was a BBC News employee, that wouldn't change the fact that BBC News is a useful and reliable source. Behaviour issues on the part of this unapproved bot operator doesn't change the fact that archive.is remains a useful service that ensures a fair number of references are actually verifiable. —Tom Morris (talk) 12:52, 21 September 2013 (UTC)[reply]
  2. Support I added many archive.is links to snooker-related articles, because they can't be found on any other archiving service due to nobot restrictions. Armbrust The Homunculus 16:45, 21 September 2013 (UTC)[reply]

2. Revert the reversions

Since no one has found a problem with the existing links that were reverted, the reverted links should be restored.

  1. I'm going to park myself here, although I do think those added after Rotlink's indef by IPs shouldn't be reinstated. Kww seems to be going around in circles with whether they want the site blacklisted or not, and are jumping to conclusions that simply have no foundation whatsoever. Lukeno94 (tell Luke off here) 12:51, 21 September 2013 (UTC)[reply]

3. Complete removal of archive.is

We should write a bot which searches for all links to archive.is, replacing them when possible, and removing them when not. When this bot is complete, archive.is should be placed on the blacklist.

  1. I prefer this option. It is based primarily on my belief that the IPs were not being used legally. This makes me distrust the motives of archive.is, and suspicious that we are being set up as the victim of a Trojan Horse: once the links to archive.is are established, those links can be rerouted to anywhere. If illegal means were used to create the links, why should we trust the links to remain safe?—Kww(talk) 15:57, 20 September 2013 (UTC)[reply]
  2. Support this option as second choice.--v/r - TP 21:01, 20 September 2013 (UTC)[reply]
  3. Support with no prejudice to human readdition - do not blacklist the link. A bot cannot determine whether the link is appropriate, but if a human editor does, he should be free to add it. ~Charmlet -talk- 21:45, 20 September 2013 (UTC)[reply]
  4. Support. This is a new and uncertain operation, and there are serious questions about its ethics and stability given what has happened. When the operation has been around long enough to show it is trustworthy, we can reconsider using it then. But as it stands, we should remove it completely and flag it as questionable so as to save editors from working on creating links to it, which later either break down if the operation closes, or lead to adverts as the owner indicates might happen. The Wikipedia article on the operation itself is currently at AfD, with six delete comments and one keep: Wikipedia:Articles for deletion/Archive.is. The company may have been using Wikipedia to make themselves known, and to pave the way for the site owner to make a profit. It is not our purpose to promote or advertise any company. Alexa shows that Wikipedia is the website's fourth largest direct supplier, and an indirect supplier via mirrors and Google searches. The website is a start up that is relying on Wikipedia to build traffic. The owner has indicated that ads may appear after 2014. We should wait until the operation has proved itself before setting up thousands of links to what may become an advert site. SilkTork ✔Tea time 09:30, 22 September 2013 (UTC)[reply]
  5. Support per SilkTork (moved from option 4). And after reading the FAQ, it seems apparent that this is a one-man operation. Combined with the possibility of ads in the future, and the evidence we have on his ethics (which make me doubt this will even be a viable ad-free service for as long as promised), we should clean this up while it's still somewhat manageable and wait to see what archive.is becomes before allowing our articles to become reliant on it (as otherwise we could end up with an even tougher problem to deal with). equazcion (talk) 09:48, 22 Sep 2013 (UTC)
    According to a response on the website there are two people running the operation. So it was either the owner who has been inappropriately using Wikipedia or the owner's partner. Either way, not a good show. SilkTork ✔Tea time 09:58, 22 September 2013 (UTC)[reply]
  6. Support for the time being. Although I've encouraged Rotlink to follow procedure at every step and promptly addressed his BRFA, it is clear that the likelihood of an ulterior motive is very high, as he has circumvented our processes at every step once he realized they will take time. And there is absolutely no reason to be in such a hurry to add massive numbers of links to one's website unless one really wants to drive traffic to their website. I know this is speculation, and I'd like to be proven otherwise. But until this is a 1-man operation, has no financial safety proof, doesn't follow robots.txt, the owner is this impatient to add links and doesn't respond, uses anonymous proxies to further add links, and there are no guarantees that the website doesn't suddenly start serving ads, I cannot endorse this archival service. Per SilkTork, the ethics and stability are too uncertain. This service first has to prove it is well-meant, reliable, and open -- two of which are already under significant doubt. For example, we have Webcite as a perfect alternative and every link rightfully archived at archive.is could have been archived at Webcite. —  HELLKNOWZ  ▎TALK 13:58, 22 September 2013 (UTC)[reply]
    If you went to the comparision with WebCite, there are more points.
    • Supporting robots.txt which was designed for crawlers is not relevant to on-demand archives. It prevents them to archive pages from many sites.
    • It is WebCite that is 1-man enterprise experiencing financial problems which could only escalate after moving to expensive Amazon EC2 cloud hosting[1]. 77.110.134.11 (talk) 14:53, 22 September 2013 (UTC)[reply]
  7. Oppose. I think, the argument of supporters are very emotional. They appeal to ethics and try to predict the future. My vision of the future is:
  • The revertion will be mass scale vandalism.
  • The editors will scream like User:Lexein. Most of them do not read ANI and RFC and do not take part in this discussion. But they get notified about the changes in their articles.
  • Many sources available only on archive.is (see ANI discussion for examples). The editors will have to circumvent the ban of the domain. Do you know how do they do it for currently banned domains?
  • Assuming that the bot was seeking for traffic, it can also circumvent the domain ban using Google Cache or WebCite. Both keep JavaScript on archived pages and the script can redirect trafic anywhere. I would say, it is even easier to steal traffic this way. 193.86.243.17 (talk) 07:49, 23 September 2013 (UTC)[reply]
"Vandalism" is a deliberate attempt to compromise Wikipedia. A mass revert in good faith is not vandalism, since the aim is to improve Wikipedia, regardless of whether the result does so.
Circumventing Wikipedia policy is pointy as well as being against policy. If some editors do that, we can deal with it as it becomes a problem, but I don't think we should make up hypothetical problems to stop us doing things that are a good idea.
me_and 16:51, 23 September 2013 (UTC)[reply]

It is not traditionally the business of an encyclopaedia to help readers to obtain out-of-print references, or their modern equivalents, as far as I know. Hypothetically, various third-party apps and third-party websites could choose to shoulder the legal risks, if any, of presenting modified versions of Wikipedia articles with archive links added. Wikipedia's content licensing allows this. Editors would be free from legal risks and would have more free time to add actual content to the encyclopaedia.--greenrd (talk) 19:39, 23 September 2013 (UTC)[reply]

  1. Support as proposer primarily on the grounds of freeing up editor time. Automation is good - even if someone else is doing it.--greenrd (talk) 19:39, 23 September 2013 (UTC)[reply]

We should replace links to archive.is that were added by the bot, where possible. Where no replacement is available, the links should be left in place. Links added by human editors should be left in place as well.

The circumstances surrounding these links leave me uneasy about leaving them alone (a startup trying to establish itself by automatically spreading its links across Wikipedia, use of proxies, unapproved bot by unresponsive entrepreneur). However I'm wary of cutting off our nose to spite our face -- if they have the only viable links to the content we need for a substantial number of references, leave the links alone in those cases. But in situations where there is a replacement available at a different, reputable service, those links should be switched over. Links added by people should also be left alone, however, as editors should be allowed to link to whichever service they want. The pervasiveness of the bot-added links establish a possible artificial trust among editors who see them that I think warrants undoing. equazcion (talk) 16:03, 21 Sep 2013 (UTC) (moved to complete removal)
  1. This is a sensible option; the easiest way (albeit not a foolproof way) is to simply nuke those added by IPs. Lukeno94 (tell Luke off here) 18:47, 21 September 2013 (UTC)[reply]
  2. Support, as long as archive.is remains ad-free and there remains no evidence the archive are not faithful renditions of the original sites. NE Ent
  3. Support: I don't see enough evidence to support reverting legitimate editors' work, but I also believe that we should stop unauthorized bots from being able to edit Wikipedia even when their edits are ostensibly positive. —me_and 09:03, 23 September 2013 (UTC)[reply]

Contact Rotlink off-wiki (using email perhaps) and encourage them to follow the community's process for bot approval so the bot can operate within policy.

  1. Support as proposer and first choice.--v/r - TP 21:03, 20 September 2013 (UTC)[reply]
  2. Support in addition to number 3. I don't support a bot for what I believe should be human judgement, but regardless, I think he should be allowed to use the community processes for approval if the community so wishes. ~Charmlet -talk- 21:45, 20 September 2013 (UTC)[reply]

6. Copy Archive.is and WebCite content to Wikimedia-controlled server until it is too late

It is only 10Tb (Archive.is) and 2Tb (WebCite). $500 question (3 * $165 (4 Tb HDD)). 193.86.243.17 (talk) 07:49, 23 September 2013 (UTC)[reply]

x3 for redundancy. Not that that makes it exorbitant or anything.--v/r - TP 13:29, 23 September 2013 (UTC)[reply]
It's never that cheap or that easy. Setting up and maintaining such a service requires more than simply the disk space. In any case, there discussions about doing this for WebCite at meta:WebCite. —me_and 16:59, 23 September 2013 (UTC)[reply]
Do not host the archive files. Only copy and keep. If WebCite would go down, give the files to archive.is or archive.org and ask them to host them. If archive.is would go down give the files to WebCite or archive.org. 88.15.83.61 (talk) 19:42, 23 September 2013 (UTC)[reply]

Discussion

  • I'm very concerned about the idea of completely removing all archive.is links, even those added by actual editors. I know of several editors who switched from WebCite to Archive.is when WebCite's future ability to archive came into question, myself being one of them. WebCite does say existing archives won't go away, but I find it hard to trust this in the long term. I'm unaware of any other on-demand services other than WebCite and Archive.is at this point, so if in a year WebCite goes away and Archive.is no longer trusted, where does this leave us? To be blunt, we're probably back to the idea of either trying to take over WebCite ourselves, or providing some funding, or...something along those lines. I know there's the issue of copyright/non-free being an issue for the Foundation, but a solution needs to be found for the long run, not simply what's convenient right now. (Sorry for rambling...) Huntster (t @ c) 07:58, 21 September 2013 (UTC)[reply]
  • I know about Archive.org, but it is not "on demand". My whole point was that without the two above-mentioned sites, we won't have access to on demand services, which are needed for archiving a specific instance of a site. Huntster (t @ c) 20:54, 21 September 2013 (UTC)[reply]
  • I think the "remove all archive.is links" option is incredibly stupid. By all means, remove all links added by IPs if you really have to do this, but really... Fuck knows why you proposed this; you may as well blacklist it if you want this solution! Also, the option I want isn't there: where we reinstate all of the archive.is links added before Rotlink's indefinite block (or those inserted before their bot was indeffed), and only remove those added by IPs after their indef. Lukeno94 (tell Luke off here) 10:29, 21 September 2013 (UTC)[reply]
  • Please reread my comment: I explained myself. I believe the owner of the site to have engaged in illegal activity, and therefore do not trust him, his site, or his future intentions. The bot was indefed on August 18, so all links created by the IPs above were placed in defiance of a block.—Kww(talk) 16:47, 21 September 2013 (UTC)[reply]
  • Again, you have made that claim, but there's no real evidence to prove it; whatever happened to "innocent until proven guilty" anyway? You may not trust the site, but it is clear that several longstanding editors - including myself - do, and still trust it. Your proposal to nuke everything flies in the face of a LOT of work by legitimate editors, particularly as archive.is has often been the only accessible archive for a given page. Lukeno94 (tell Luke off here) 18:45, 21 September 2013 (UTC)[reply]
  • Considering I know absolutely nothing about how VPNs and proxies work, I don't know what is legitimate and what isn't. But the actions of one person, regardless of who they are, shouldn't result in lots of other people having their hard work undone (as it can be VERY hard to find a working archive for a link sometimes...) Lukeno94 (tell Luke off here) 20:04, 21 September 2013 (UTC)[reply]
  • Kww, I see you presume it should be obvious to anyone that it was done illegally. I'm reasonably familiar with how proxies work and I'm not sure I understand your reasoning. If I wanted to set up proxies in several different countries I'm fairly certain I could do it legally. Could you explain what you believe to have occurred here that was illegal, and what leads you to think that? I'm asking honestly, not necessarily out of doubt. You may be more knowledgeable in these things than I am. equazcion (talk) 20:17, 21 Sep 2013 (UTC)
  • First is Occam's razor: what would prompt anyone to actually go to the expense of negotiating individual proxy hosts in places ranging from Qatar to Brazil to Vietnam? Second is the nature of the IPs: they aren't webhosts and servers. Instead, they are individual IPs on adsl networks, FTTH networks, cable modems, etc. Everything about the setup screams "botnet". If it was a legitimate proxy arrangement, I would expect to see webhosts and servers hosted in a small number of countries with good internet access.—Kww(talk) 00:27, 22 September 2013 (UTC)[reply]
  • I think known open and advertised proxies tend to be preemptively blocked from editing. equazcion (talk) 09:18, 22 Sep 2013 (UTC)
  • This can explain why there are so few webhosts and contiguous blocks of IP. They were already blocked from editing. I can imagine another simple way to get proxies. We are talked about a site owner, right? Then he/she can see access logs of the site. There are usually a lot of hits from malicious security scanners (looking for SQL-injections, etc). Those IPs are proxies and can be connected back and reused. Setting up own proxy infrastructure looks too expensive. 77.111.172.172 (talk) 09:53, 22 September 2013 (UTC)[reply]
  • I'm very concerned about the removal of archive.is links too. A while back, I tried to fix all the references that use the now offline site pinkpaper.com, the website of the Pink Paper, one of the UK's main LGBT news sources. Between archive.org and archive.is, I managed to find replacements for some but not all of the references used. The LGBT topic area tends to be filled with a lot of poorly sourced material especially around BLP subjects. Removing archive.is links is likely to leave a lot of those links broken. I don't really know what's going on with the IP and the non-approved bot account, but I'd rather if all the hard work I put into fixing PinkPaper links were removed just because of somebody else's behaviour. And I'm not keen on having BLP articles on sexuality-related topics potentially left without sources. This seems self-defeating. Whatever the problem is, please can you seek more of a calmer, less dramatic solution than removing all the links to a useful archival service. —Tom Morris (talk) 12:48, 21 September 2013 (UTC)[reply]
  • Thought/idea: would it be possible to wrap the archive.is links up in an external links template? (similar to {{IMDb title}} etc) That way, if the site goes hinky in the future, all the links could quickly be disabled, minimizing negative fallout. Siawase (talk) 12:03, 22 September 2013 (UTC)[reply]
    • This is a good idea. Wouldn't it be better to backup its content to a Wikipedia server? If the site goes hinky in the future, all the links could be changed to something like archiveis.wikimedia.org instead of disabling the links and hitting the verificability issue. 77.110.134.11 (talk) 12:15, 22 September 2013 (UTC)[reply]
    • My concern with this is that a template would be seen as tacit approval of these links, which I think we're a long way from having. I know I would see the use of such a template as implying the community considers these links to be A Good Thing, particularly if Archive.is had such a template while other archiving services didn't. —me_and 09:06, 23 September 2013 (UTC)[reply]
  • The introduction of this RfC misses the fact that User:Rotlink (user, not bot) himself added a lot of links [1] between having his bot blocked, withdrawing his BRFA and until this was pointed out to him [2]. —  HELLKNOWZ  ▎TALK 13:35, 22 September 2013 (UTC)[reply]
  • Do we even have any proof that Rotlink is the owner of the website, and isn't just claiming to be? I'm still disgusted that the actions of one person could lead to the reversion of a shedload of good edits by legitimate editors; regardless of whatever position they hold. Frankly, the age of an archiving site is utterly irrelevant; if it does go under, or if it does end up with adverts, then THAT is the time to propose its removal. Seems like several people have forgotten about WP:CRYSTAL... Lukeno94 (tell Luke off here) 14:37, 22 September 2013 (UTC)[reply]
  • At the risk of looking like a total pratt, that isn't convincing. Lexein has communicated with Rotlink, who claims that they are the owner. Lexein's usage of words doesn't confirm or disprove the claim. I'd like to see something rather more solid before we jump to conclusions about whether to include archive.is links or not. Also, the presence of a Wikipedia article, and the reliability and/or notability of it, has precisely nothing to do with whether we use an archiving site or not; bringing that up is unnecessary and deliberately inflammatory. Lukeno94 (tell Luke off here) 19:27, 22 September 2013 (UTC)[reply]
  • better diff. —  HELLKNOWZ  ▎TALK 19:37, 22 September 2013 (UTC)[reply]
  • CRYSTAL applies to determining article existence and content. A modicum of speculation isn't unreasonable when it comes to technical concerns, which this can easily become if an archive site becomes widely relied upon by articles and then becomes unviable. equazcion (talk) 19:43, 22 Sep 2013 (UTC)
  • The possibility of WebCite going under was, and perhaps still is, very real. Did that mean we blanket removed every single link? No, it didn't. The diff Hellknowz shows a lot of technical knowledge; but it's fairly generic stuff that anyone who goes and looks things up for could come out with. Given that it is near a year old though, it appears probable, if not certain in my mind, that Rotlink is an owner, or employee. It does not, however, confirm he is the only owner; nor should it matter one iota if a website is owned by one guy, two guys, or a consortium. Lukeno94 (tell Luke off here) 19:55, 22 September 2013 (UTC)[reply]
  • Luke, Lexein has emailed the owner and has later confirmed that the owner is Rotlink. I was not trying to "bring up" that article (you surely know about it already) and am disturbed that you think it "deliberately inflammatory" to try to help answer the question you asked, "Do we have any proof...", by referencing another editor's research. NebY (talk) 22:20, 22 September 2013 (UTC)[reply]
WebCite is no longer accepting submissions like they used to. I tried it today. They rejected my target page with the false summary claiming that my email address was incorect. Archive.is accepted my submission no problem. Poeticbent talk 21:01, 22 September 2013 (UTC)[reply]
I just archived a page fine, perhaps your e-mail address was invalid, like a stray character. —  HELLKNOWZ  ▎TALK 23:02, 22 September 2013 (UTC)[reply]
Must've been a temporary outage. I tried it yesterday and got the same error message. Good to know though that it's back up again for the time being. De728631 (talk) 15:23, 23 September 2013 (UTC)[reply]