Jump to content

Help talk:Using archive.today

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 93.185.28.66 (talk) at 16:00, 14 July 2016 (→‎archive.is in HTTPS). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Template:Multidel

September 2013

One of the editors posted the following edit summary, when removing the words commenting on the copyright issues:

"Remove grossly inappropriate description, US law is not world law, so don't apply it here, not to mention this is seriously confusing what copyright really is, whether there is or isn't robots.txt is irrelevant and libraries have different laws."

First, no one has claimed US law is world law. All I am saying is that Wikipedia does not want (and is not legally able) to violate US copyright law, nor does it want to incur endless DMCA take-down requests, which will surely be the result if people start linking Wikipedia articles to unauthorized archived copies of copyrighted works. Pointing out that copyright laws are different in other countries is obviously irrelevant.

Second, I agree that the honoring of robots and the honoring of copyright laws are two different things. However, as the proposed wording explains, robot exclusion files are the only known means used by responsible web archives to avoid copyright infringement. If Archive.is has some other way of avoiding copyright infringement, that would be fine. But they don't. The Archive.is contains a large amount of copyright infringing material, which anyone can see for themselves. (See an example on the Wikipedia article on Archive.is, but you better hurry, because there is a nomination for deletion of that article.) So, the fact that Archive.is refuses to honor robot exclusions for copyrighted material is closely related to the fact that they are violating copyright law.

Third, the editor says "libraries have different laws". I don't know what that is supposed to mean, but if anyone thinks it means that libraries or online archives are allowed to violate copyright law, they are mistaken.

Fourth, the editor says the proposed text is a "grossly inappropriate description", but justification for this claim is based on the misunderstanding noted above. The proposed text is entirely appropriate. Wikipedia should not be a party to copyright infringement. Can we at least agree on this?Weakestletter (talk) 21:12, 23 September 2013 (UTC)[reply]

The text I removed is misrepresenting, clearly written in a biased and non-neutral tone with a prejudice against Archive.is. This is a how-to guide, instead the users are presented with this text in boldface, first thing on the page, saying they shouldn't use it, and making legal claims. If you cannot see how this is inappropriate, I'm afraid I will not be able to explain it.
I replaced the text with the most straight-forward, unbiased version that makes no assumptions or claims: "Note: Archive.is does not follow robots exclusion standard and may archive content that the website owners have excluded from automated crawlers.". Everything else is merely adding inflammatory language instead of being a helpful guide. And you cannot be making legal claims on Wikimedia's behalf.
For your first point, archive.is is not in US. They have no obligation to follow DMCA or any US law. And they are not breaking US/CA copyright laws, which is what Wikipedia uses. We link to thepiratebay.sx from The Pirate Bay. By your assertion, Wikipedia is breaking the law.
For your question about libraries, US library services (in the way archivers are classified) have different laws regarding copyright infringement, mainly that they do not infringe anything if they follow certain US rules (this is even in the link attached). Google is such a service. robots.txt is just one way of following those US rules. —  HELLKNOWZ  ▎TALK 22:10, 23 September 2013 (UTC)[reply]
Many US magazines and publishing houses (and I suppose they know about copyright much more than us and take care on it) see no problem to use archive.is and to link it. I suggest to remote the copyright alarm as lame. 88.15.83.61 (talk) 19:35, 24 September 2013 (UTC)[reply]

About the recent tag edits

It was from an editor near the end of Wikipedia:Deletion review/Log/2013 October 28, followed by its reversion by me. --Lexein (talk) 14:40, 29 October 2013 (UTC)[reply]

No it isn't needed at all. This passed MFD, period. The dispute was resolved howto kept. Nobody brought up WP:Using Archive.is during the WP:Archive.is RFC discussion period. It wasn't considered relevant, and it isn't. Tag removed. Please don't re-add it. If you're dead set against deletion, start another MFD. People do it all the time, with rarely changed results. --Lexein (talk) 23:35, 29 October 2013 (UTC)[reply]
Taken to Wikipedia:Administrators'_noticeboard/Incidents#De-linking_of_Wikipedia:Using_Archive.is_a_challenged_How-To_to_its_RfC. --SmokeyJoe (talk) 00:17, 30 October 2013 (UTC)[reply]

How do I properly link to http://archi ve.is/jPlGB (added space) in a reference for an article? (It *was* http://kappapiart.org/join.html)Naraht (talk) 17:48, 5 May 2016 (UTC)[reply]

@Naraht: archive.li/jPlGB may work but it looks like the content you wanted to reference has moved to a new address: http://kappapiart.com/new-members. —LLarson (said & done) 18:52, 5 May 2016 (UTC)[reply]
@LLarson: Thanx! for *both* how to use the site (with a different country) *and* finding the new national page. Odd that it doesn't show up on the first page of google.Naraht (talk) 18:59, 5 May 2016 (UTC)[reply]

RfC: Should we use short or long format URLs?

This RfC is to gauge community consensus about the preferred URL format for archive.is and WebCite when used in citations.

Both sites permit two URL formats, a shortened version and a longer version. Examples:

archive.is
WebCite
(pretend they go to the same link)

Which one is preferred, or are either equally appropriate?

Related information:

  • The document Using WebCite says "Either is appropriate for use within Wikipedia," while the document Using archive.is says "The [longer] is preferred for use within Wikipedia because it preserves the source URI."
  • During the archive.is RfC, some users brought up concerns short URLs can hide spam links, noting URL-shortening services such as bit.ly have been blacklisted from Wikipedia.
  • Reverse engineering a shortened link to find the original URI can be done using the WebCite API, or web scraping in archive.is case.
  • WebCite and archive.is default to the shortened version when creating a new link or in page display.

Please leave a !vote below, such as short or long or either. -- GreenC 21:50, 5 July 2016 (UTC)[reply]

Discussion

This shouldn't be an RfC — we always use long. Link-shorteners are not allowed. Carl Fredrik 💌 📧 23:48, 5 July 2016 (UTC)[reply]

Following the below discussion lets go with long, forbid short. And I don't mean block short, just to automate conversion to long links. Carl Fredrik 💌 📧 08:58, 6 July 2016 (UTC)[reply]

Is it link-shorteners that are forbidden or is it redirection sites? What I've been able to find is Wikipedia:External links#Redirection sites. If there isn't an explicit existing guideline then it makes sense to show consensus for one for similar reasons as for redirection sites. PaleAqua (talk) 05:20, 6 July 2016 (UTC)[reply]

Can/should a bot be set up to automatically convert short links to the long form? Especially given the default link returned by the archive services is a shortened version. PaleAqua (talk) 03:50, 6 July 2016 (UTC)[reply]

Could a bot do it? Yes in most cases. But without community consensus (RfC) for using the long form URL, it could have trouble in the bot approval process or during the bot run should someone raise an objection. -- GreenC 04:22, 6 July 2016 (UTC)[reply]
Figured that. I prefer long, but think that rather than forbidding short links that they should be converted to long links, either manually or via bot. Archive links should be matchable to the dead url which is harder to do with shortened links. I'd also like to see the URLs to archive.is unified if possible as during the ban it seems aliases were used to bypass the filters. ( I supported unbanning, but still think it's important to be aware of the prevalence of the links etc. ) PaleAqua (talk) 05:20, 6 July 2016 (UTC)[reply]
  • Prefer long, but don't forbid short - archive.is gives you the short version of the link by default, but the long version contains the source URL - so running a bot to convert them to the long version strikes me as the ideal solution - David Gerard (talk) 08:08, 6 July 2016 (UTC)[reply]
  • (edit conflict) Prefer long, but don't forbid short – per above. Though short URLs are nice, it's probably best to use long URLs given the concerns such as the possibility to bypass the spam blacklist. I would support having a bot convert them to long URLs rather than forbidding short URLs, and unifying all archive.is domain links to "https://archive.is/" (while you're at it, change HTTP to HTTPS as well) is a good idea as well. Note that "preserving the source URI" is not a problem if editors use the {{cite web}} template correctly by placing the original URL in the |url= template and the archived version in the |archiveurl= template. Also, note that WebCite uses a query parameter for the original URL in the long form, so it may still be possible to bypass the blacklist. Perhaps the easiest would be to have a bot also check those and perform URL decoding and matching against the blacklist, and reporting if it finds a blacklisted link. nyuszika7h (talk) 08:10, 6 July 2016 (UTC)[reply]
Note: Please see my comment below, I have now realized having a bot always clean up after new link additions is not a good idea. nyuszika7h (talk) 13:45, 10 July 2016 (UTC)[reply]
@Nyuszika7H: Good idea about matching against blacklist. There are also bare links that don't use cite templates but could use a template like {{wayback}}. Do you know other domains for archive.is and webcitation.org ? — Preceding unsigned comment added by Green Cardamom (talkcontribs) 14:01, 6 July 2016 (UTC)[reply]
@Green Cardamom: As far as I know, WebCite doesn't have any alternative domains. Archive.is has archive.is, archive.today, archive.li, archive.ec and archive.fo. nyuszika7h (talk) 14:09, 6 July 2016 (UTC)[reply]
@Green Cardamom: Freezepage also has short form (example http://www.freezepage.com/1465141865DVZXYCBROO). 93.185.30.40 (talk) 15:33, 6 July 2016 (UTC)[reply]
  • Long. Let's remove one concern off the minds of the people who oppose this service: Obfuscation by spammers. —Best regards, Codename Lisa (talk) 08:52, 6 July 2016 (UTC)[reply]
  • Long. If there is any possibility of spammer abuse with the short links, then the short form will be added to the blacklist anyway. ~Amatulić (talk) 13:42, 6 July 2016 (UTC)[reply]
  • Long for bare links, Short for use in templates like {{cite web}}. They preserve the original URL and the capture date in other arguments so it would be an ugly duplicate and another RFC about how to deal with that duplication of long urls, by introducing special templates for the archive links or by developing {{cite web}} to calculate archive links by itself or somehow else. 93.185.30.40 (talk) 15:28, 6 July 2016 (UTC)[reply]
Wouldn't the spam problem remain with short use in templates? Theoretically a bot could do the verification, constantly checking a link over and over again to make sure it matches the url argument, but that's a lot of network resources and bot maintenance. Ideally a bot converts from short to long 1 time, that way editors can visually see the URL as being legit or not, and other bots can check for blacklisted links. -- GreenC 16:18, 8 July 2016 (UTC)[reply]
  • Better to set up a bot to change short links to long as soon as they are inserted. It is not a job for a human to read such a warning and then concatenate strings. 93.185.30.244 (talk) 11:43, 10 July 2016 (UTC)[reply]
  • Nope, it is not better - the bot would not be able to save the converted links (as they were blacklisted), and then someone has to later come and a) remove the 'redirect', and b) find the original editor who inserted the evading link (and take action where needed). It is better to prevent, and educate (which in itself prevents the number of edits, and that a bot has to do a rather cosmetic edit, and that the bot may make mistakes in the conversion and that all of these need to be double checked by yet another human). --Dirk Beetstra T C 12:46, 10 July 2016 (UTC)[reply]
  • If the bot notices it's blacklisted, it could place a tag on the article like already done for existing links directly found on the blacklist. Though I suppose an edit filter which tells users how to obtain the link could work (for WebCite it's straightforward, for archive.is the user needs to click the "share" link first), might be simpler than having a bot do the cleanup now that I think about it. But the edit filter shouldn't block those edits completely, at least initially, because articles currently containing short URLs need to be cleaned up – ideally by a bot. nyuszika7h (talk) 13:01, 10 July 2016 (UTC)[reply]
    • I was mainly suggesting this for the newly added links. What is there now should indeed be changed by a bot first, and some cleanup regarding the few that might be there that are linking to otherwise blacklisted material. --Dirk Beetstra T C 13:19, 10 July 2016 (UTC)[reply]
  • If we go with long-only agree an edit filter for new link additions is a good idea, and a 1-time bot to cleanup old cases, which is significantly more likely to get done than a demon bot, which has added complexity and maintenance. -- GreenC 15:13, 10 July 2016 (UTC)[reply]
  • I don't like the idea of blocking short links. The entire reason why we allowed archive.is is to make it easier to edit, to add more hoops you need to jump through is negative. I suggest a continuous bot that always runs and checks whether the short links are being used — and automatically replaces them. What we need is for editing to be easy — forcing people to jump through hoops will only cause them to think its too much of a hassle, abandoning their edits. Carl Fredrik 💌 📧 11:15, 13 July 2016 (UTC)[reply]
  • @CFCF: As I state below, the situation is not much different from the current situation with url-shorteners - people there do have to go back and repair their link (or may choose to abandon their edits). But I would at first just suggest a 'warning-only' edit-filter (obviously not with a warning, but with a good faith remark), and have a bot lengthen all the short links that are there, and those that are still being added anyway. In that way it all gets tagged (and with a bit of luck many people will lengthen their links, avoiding the extra bot-edit), and it is easy to check what gets added anyway, and if that indeed shows significant real abuse of the short links (real spammers don't care about warnings, bot reverts, blocks on IP addresses, etc.) then we can always reconsider the situation (though I don't expect that it will become that bad - and there may be intermediate solutions as well). Do note that a template solution would remove the need for the user to add the link at all - the original url is in one parameter, the archive url is in another parameter - one could chose to have the template construct the archive link (which will work for most) from the original link and a archive-date-parameter. That would make the work easier for the editor. --Dirk Beetstra T C 11:47, 13 July 2016 (UTC)[reply]
  • Agreed, documentation of the long form syntax is a simple 1-time learning curve; and use of templates for external links (eg. {{wayback}}). These are the correct solutions already used by other archives like Wayback. Allowing short URLs against policy, then assuming a bot will fix intentional mistakes forever, is not a good idea for a bunch of reasons. Anyway I certainly won't be writing a bot to do that. Maybe someone else will... we are talking 100s of hours of labor and a personal commitment for unlimited years to come. These bots don't run themselves they have constant page formatting problems due to the irregular nature of wikisource data, it's not a simple regex find and replace you have to deal with deadurl, archivedate, {{dead}}, soft-404s, 503s, 300s at archive.is etc.. its complex and difficult to write and maintain. -- GreenC 13:18, 13 July 2016 (UTC)[reply]
  • Long only but no conduct enforcement against individuals who use short. The only remedy against short url's should either be automatic change by a bot (strongly preferred) or being changed by the editor who finds them and objects to them, or both. Whether an individual editor only includes short ones one time or does it routinely even after being informed that long ones are required should not subject him/her to sanctions. This is a very minor rule applicable to only a couple of sources and if we're going to have a rule, it needs to be drama-free. The only time conduct enforcement should be involved is if someone edit wars over it (and there absolutely shouldn't be a 3RR rule exception for changing or reverting short url's) or the short url is clearly being used for a nefarious or unencyclopedic purpose. Regards, TransporterMan (TALK) 18:19, 12 July 2016 (UTC)[reply]
  • Long, Second TransporterMan. Long is better, and maybe have a template to mention it to people. Probably not even that. I don't care how many times someone "violates" this, it's a ridiculous thing to punish people for. It's that kind of rule creep that people complain about, because it discourages new editors. Just fixing it is just standards maintenance. This is something a bot can handle with ease, so lets not waste peoples time with it. Tamwin (talk) 06:48, 13 July 2016 (UTC)[reply]
    • @TransporterMan and Tamwin: Nobody suggested punishing people for it (unless of course they are using it to evade the blacklist). Note how Special:AbuseLog reads at the top: "Entries in this list may be constructive or made in good faith and are not necessarily an indication of wrongdoing on behalf of the user.". – nyuszika7h (talk) 08:21, 13 July 2016 (UTC)[reply]
    • @TransporterMan and Tamwin: I want to second this, except for deliberate disruption (e.g. intentional evasion of the blacklist) there will/should be no punishment. Even now, if you add a blacklisted link you are not punished, you are just blocked to save the edit. No-one will/should go after you when you are, in good faith, blocked by the spam blacklist. That will/should not change. The problem with allowing the short links is that a) people may, in good faith, add blacklisted links which may need removal, b) people will use the short links to evade the blacklist (they do that with regular url-shorteners on a regular basis), c) every addition of a short link will need to be followed up with another edit by a bot (and maybe even checked whether the bot was always correct) - all of which can be avoided. The situation in the end will not be different from editors inadvertently using an url-shortening service (or similar 'bad' links), something that happens regularly due to online services who standard shorten certain links, who will now not be able to save their edits, having to go back, and 'lengthen' their current urls. --Dirk Beetstra T C 08:59, 13 July 2016 (UTC)[reply]
  • Long form, convert short, in the way User:CFCF suggests. Use automated conversion to the long form. The short form obfuscates the archived page, while the long form provides essential information about the archived data at a glance, which are the date, and the original address. Users can insert both the long and short forms, but a bot or a script would automatically convert the short form into long (without any warning messages or similar cautionary measures). It should be mentioned in the appropriate WP pages that the long form is preferred. —Hexafluoride Ping me if you need help, or post on my talk 13:46, 14 July 2016 (UTC)[reply]

archive.is in HTTPS

Moved from Talk:Archive.is

@93.185.30.56: I see that you're from Russia, where archive.is is blocked in HTTPS. However, the situation varies. China for example block HTTP but not HTTPS. archive.is is HTTPS now by default. There's no policy on how to link to archive.is. And seeing how most of the world can access it through HTTPS, then linking to it in the encrypted form is better than exposing traffic to an unsecure link. Users in countries where one form or the other is blocked should make the extra effort to modify the link (HTTP/HTTPS) when trying to access it. —Hexafluoride Ping me if you need help, or post on my talk 13:33, 14 July 2016 (UTC)[reply]

Also note that Wikipedia itself is only available in HTTPS, so the point is pretty much moot (as in irrelevant). nyuszika7h (talk) 13:49, 14 July 2016 (UTC)[reply]
Hexafluoride, also note that pinging IPs does not work, try leaving a talkback on their talk page instead if you want to get their attention. nyuszika7h (talk) 13:50, 14 July 2016 (UTC)[reply]
@Nyuszika7H: I've mistakenly put this in the archive.is talk, instead of Wikipedia:Using archive.is. I'm moving it there. I've also Tb'ed the IP user. —Hexafluoride Ping me if you need help, or post on my talk 14:00, 14 July 2016 (UTC)[reply]

There was an RfC that decided to use use https for archive.org .. I don't see why archive.is would be any different result if there was another RfC. The argument that certain countries block https, would that be solvable with Wikipedia:Protocol-relative URL? — Preceding unsigned comment added by Green Cardamom (talkcontribs) 15:13, 14 July 2016 (UTC)[reply]

@Green Cardamom: As I said, Wikipedia itself is only available over HTTPS, so it's pointless. nyuszika7h (talk) 15:17, 14 July 2016 (UTC)[reply]
you have misinterpret the point. The countries do not block the https as whole, they block particular domains. https://archive.is is blocked in Russia, but https://archive.li is not. As for China, another way of blocking there (https is available and http is blocked) is not relevant here because Wikipedia is blocked in China, so it is pointless to optimize the Wikipedia pages targeting China. https://archive.is is the only protocol-domain combination (among 10: {http:archive.is, http:archive.today, http:archive.li, http:archive.ec, http:archive.fo, https:archive.is, https:archive.today, https:archive.li, https:archive.ec, https:archive.fo}) with geo-problems. Clicking such links from Russia (where Wikipedia is not yet blocked) lead to a network error, not even to a message "the page is blocked", thus the https://archive.is links look like dead links. If you insist on using https, consider linking to another domain (archive.li, .today, .fo or .ec - they all do work with https well) 93.185.28.66 (talk) 15:29, 14 July 2016 (UTC)[reply]
If they block archive.is, I don't think it's worth playing whack-a-mole with them, they will eventually block other domains if they get widespread use anyway. nyuszika7h (talk) 15:32, 14 July 2016 (UTC)[reply]
The problem is different (you may read in Russian Wikipedia about it). "They" do not "want" to block archive.is. They need to censor few pages which are illegal in Russia. With http - because it is not encrypted - only those pages are blocked and replaced with the message explaining why it is blocked. With https - because it is encrypted - it is not possible to block a particular page, only whole domain. So, using another domain can be considered a whack-a-mole while using http is a fair play with the big brother: he sees your unencrypted traffic and tries to be the least disturbing as possible.93.185.28.66 (talk) 15:38, 14 July 2016 (UTC)[reply]
* BTW, https://archive.today redirects to http://archive.is in Russia and to https://archive.is outside and serves no content, only redirect so it has the smallest chance to be blocked. It can be a good compromise as default form of linking to archive.is from Wikipedia. 93.185.28.66 (talk) 16:00, 14 July 2016 (UTC)[reply]