Jump to content

Talk:Archive.today: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
→‎Hatnote: comment
Line 102: Line 102:
What's better here - ''"Not to be confused with Internet Archive."'' or some variant of ''"For the San Francisco-based nonprofit website at archive.org, see Internet Archive."''? [[User:94.230.146.228]] is concerned that by being specific we're implying that the two websites are connected, but I think it's more misleading to say "not to be confused with Internet Archive" because that can be easily read as "not to be confused with archiving on the internet in general" - a reader actually looking for archive.org (without knowing its URL) might not think to click that and assume that they're already at the right article. --[[User:McGeddon|McGeddon]] ([[User talk:McGeddon|talk]]) 10:02, 11 May 2016 (UTC)
What's better here - ''"Not to be confused with Internet Archive."'' or some variant of ''"For the San Francisco-based nonprofit website at archive.org, see Internet Archive."''? [[User:94.230.146.228]] is concerned that by being specific we're implying that the two websites are connected, but I think it's more misleading to say "not to be confused with Internet Archive" because that can be easily read as "not to be confused with archiving on the internet in general" - a reader actually looking for archive.org (without knowing its URL) might not think to click that and assume that they're already at the right article. --[[User:McGeddon|McGeddon]] ([[User talk:McGeddon|talk]]) 10:02, 11 May 2016 (UTC)
: Perhaps link to Archive.org in the hatnote, which is a redirect to Internet Archive? [[User:Nyuszika7H|nyuszika7h]] ([[User talk:Nyuszika7H|talk]]) 20:18, 11 May 2016 (UTC)
: Perhaps link to Archive.org in the hatnote, which is a redirect to Internet Archive? [[User:Nyuszika7H|nyuszika7h]] ([[User talk:Nyuszika7H|talk]]) 20:18, 11 May 2016 (UTC)
:Saying ''"For the San Francisco-based nonprofit website at archive.org, see Internet Archive."'' has a false connotation of "archive.is is sort of archive.org but for-profit" or even "there is a single company with non-profit and for-profit products; the former is archive.org and the latter is archive.is"; Anyway "...for non-profit see ..." implies that the following text is about something which is not non-profit albeit archive.is is non-profit as well [[Special:Contributions/94.230.146.228|94.230.146.228]] ([[User talk:94.230.146.228|talk]]) 16:37, 12 May 2016 (UTC)

Revision as of 16:37, 12 May 2016

The article contains false information: "archive.today removes archived pages in response to DMCA takedown requests from copyright holders." As a webmaster, I've had my site scraped against my will and sent a properly formatted DMCA to both the site and its ISP. It is a scarping site, masking as an archiving service. 80.62.117.71 (talk) 13:36, 26 June 2014 (UTC)[reply]

See also

 Ark25  (talk) 19:06, 26 July 2013 (UTC)[reply]

How does automatic archiving work?

Discussion of features/bugs, not the article

If you look at ro:Biserica de lemn din Hilișeu-Crișan, there is a dead link:

Archive.is knows this link: http://archive.is/http://www.ziarullumina.ro/articole;1418;1;3759;0;Schit-de-maici-cu-o-biserica-unicat.html Strangely, it has only a "newest shot" (6 Jul 2013 03:25) which is an error page and that makes me wonder if it ever had older "shots" and then maybe it deletes the older shots? That would be bad..

The link is there since 29 november 2010, so I guess it was archived before 6 Jul 2013 on Archive.is (Almost all external links of Wikipedia (all Wikipedias, not only English) were archived in May 2012 says the Archive.is owner here: Wikipedia talk:Link rot#Archive.is)

It's a very very good idea to archive automatically all the external links of Wikipedia, but then it's very bad to delete them and to replace with newer shots, which will eventually end up in showing "dead link".

It very much looks like Archive.is keeps only the newest shots when it archives pages automatically. —  Ark25  (talk) 23:48, 26 July 2013 (UTC)[reply]

Hi. The page was not archived before 6 Jul 2013. It has short url http://archive.is/HSzOE. This means it has the sequential ID of 47136582 and can also be accessed as http://archive.is/id/47136582 (it is not public url, something like debugging tool, do not use it for linking). If you have a look at the snapshots with the IDs around 47136582 (for example http://archive.is/id/47136581 or http://archive.is/id/47136583) you will see that all of them were made 6 Jul 2013.
Some snapshots are re-archived and overwritten. These are snapshots from urls like http://www.google.com/sorry/indexredirect?continue=http://another.url/. Re-archiving would help when the server responds 500 error or captcha.
Realtime tracking of the recent changes in all national Wikipedias is a relatively new feature (it is fully on duty from May-June 2013), so no wonder that some links which had been in Wikipedia for years have been archived for the first time only in 2013. Rotlink (talk) 03:16, 18 August 2013 (UTC)[reply]

Useful feature

Archive.is can archive pages in the Google search cache. Once the content is archived, archive.is attributes it to the original website URL and not to Google's cache URL. This feature is useful when a site goes offline, that fact is noticed within a few days, the page isn't already archived in the Internet Archive, WebCite or elsewhere, and the only remaining copy of the page appears to be in the Google search cache. - 81.157.199.46 (talk) 20:50, 29 July 2013 (UTC)[reply]

If this is a fact intended to go into the article, then an independent reliable source, or at the very least primary, published online documentation will be required to support it in an inline citation. This will necessitate more hard HTML documentation or help pages at archive.is. Statements by non-published (non-notable) or anonymous authors in blogs/wikis/forums cannot meet WP:RS requirements. --Lexein (talk) 07:47, 3 October 2013 (UTC)[reply]

wiki.dandascalescu.com

In the comments made in the AfD discussion I don't see a consensus for removing this citation. —rybec 14:51, 21 September 2013 (UTC)[reply]

It's a wiki; that's a gnarly WP:RS problem, even if Dan Dascalescu is an established or published expert in the field or academia, cited by others. The blog might be assessed as RS if we can establish Dan's bona fides.--Lexein (talk) 07:51, 3 October 2013 (UTC)[reply]

Robot Exclusion Standard

The article seems to be getting mixed up regarding the Robot Exclusion Standard, and the fact that Archive.is does not honor the standard, and what this means. The purpose of my recent edits was to clarify that this standard is used by the main archives (like WayBack and WebCite) to avoid infringing on copyrights, whereas Archive.is does not honor this standard, so there is a large amount of material re-hosted on Archive.is that is in violation of copyright law, specifically, the Digital Millennium Copyright Act (DMCA).

Some other editors deleted the link I provided to the Robot Exclusion Standard (saying it is a "dead link", although I have no trouble accessing it), and then inserted the statement: "... however, the protocol is used against malware robots in general, which routinely scan the web for security vulnerabilities and email-address harvesters used by spammers. Archive.is does not obey the robot exclusion standard designed against spammers." I frankly don't understand these words. The Robot Exclusion Standard doesn't provide any protection against malware robots, nor against spammers. It is a voluntary standard that is used by responsible organizations to work together to avoid unintended interactions, among which are copyright violations (which of course are NOT discretionary).

So, I propose to trim the words about malware and spam, and just go back to the relevant and well-sourced statements about how archives use robot exclusion to avoid copyright infringement, and the well-sourced and undisputed fact that Archive.is does not honor this standard. I'll also add the requested citation for the DMCA.Weakestletter (talk) 21:57, 22 September 2013 (UTC)[reply]

By the way, as I was editing the article, I noticed that the words about malware and spam actually make no sense at all, because they say "the protocol is used against malware", and yet the cited reference says just the opposite: "...malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention. " These words signify that the protocol is NOT useful against malware or spammers. So those edits to the article are clearly not right. I've deleted them.Weakestletter (talk) 22:04, 22 September 2013 (UTC)[reply]
Yeah, that looks like a basic misunderstanding of what robots.txt actually does -- i.e. nothing by itself -- it is the robots that choose to act or not act on it. —  HELLKNOWZ  ▎TALK 22:58, 22 September 2013 (UTC)[reply]
The article seems to imply that pages are retrieved when someone requests that they be archived, and that archive.is doesn't retrieve pages en masse but singly when someone requests a page. If it's not crawling sites, it's debatable whether it's a bot.
If someone were to spider and republish the contents of a Web site, the mere absence of a robots.txt would make a poor defence against claims of copyright infringement. The Wikipedia article doesn't mention the word "copyright"--the protocol is not a way to give or withhold permission to republish.
The DMCA is a US law. The .is top-level domain suggests that archive.is may be in Iceland, which is not (yet) part of the United States. The DMCA makes exceptions for libraries and archives, so if archive.is does happen to be under US jurisdiction, it might be able to claim it is an archive, and republish without violating copyright. —rybec 01:15, 23 September 2013 (UTC)[reply]
It's certainly true that there's a difference between not honoring robot exclusion files and not honoring copyright. In theory, an archive service could try to contact each individual website owner and request permission to re-host their material. Strictly speaking, the copyright laws probably require this... but that would make web archiving almost impossible in practice. So, working in good faith, the responsible archivers (e.g., WayBack, WebCite) honor robot exclusion files. They advertise that if any copyright holder doesn't want their material archived, just place a robot exclusion file in the root directory, and they will respect your copyright and not re-host your material. If you don't, then the archives will take the absence of a robot file for tacit permission to archive your site. This isn't a perfect arrangement, and it's rather heavily biased in favor of the archives, but it's workable.
Now, an archiving service like Archive.is comes along, and says they do not honor robot exclusion files, and they will re-host any material they want to, even if there is a robot file saying the author does not give permission. That's a clear violation of copyright, and moreover, it is clear that Archive.is is not even attempting to honor copyright.
By the way, the DMCA does NOT allow archives to re-host copyrighted material. (Brick and mortar libraries obviously are allowed to loan out purchased paper copies of copyrighted material, but that is completely different, and even they are under strict prohibition against creating any new copies, beyond the physical copies they purchased.) In fact, the whole purpose of the DMCA take-down agreement is so that large sites can avoid prosecution for copyright violation IF they agree to promptly take down and de-link (and even remove from search results) any site that is re-hosting copyrighted material without permission.
One more comment - The fact that Archive.is is hosted outside the United States is not particularly relevant, because they are lobbying to be used as citations for Wikipedia articles (for example), and Wikipedia does business in the US, and strives to avoid violating US copyright laws. So the whole ostensible purpose of Archive.is is undermined by its failure (so far) to implement some effective means of honoring copyright law. Until it does so, I don't think it can be adopted by any reputable web site.Weakestletter (talk) 02:32, 24 September 2013 (UTC)[reply]
I see that the DMCA exemption for libraries and archives (look for section 404 at [1]) only allows the copied material to be used on the premises, not over the Internet:

any such copy or phonorecord that is reproduced in digital format is not otherwise distributed in that format and is not made available to the public in that format outside the premises of the library or archives.

Is there a reliable source that says archive.is is "lobbying to be used as citations for Wikipedia articles"? I took out the sentence about Wikipedia because it seemed to belong on a Wikipedia project page, not in a regular article. —rybec 03:44, 24 September 2013 (UTC)[reply]
That is it's entire function. See the wikipedia project page, where this is being promoted:
http://en.wikipedia.org/wiki/Wikipedia:Using_Archive.is
In particular, note that "Archive.is monitors RecentChanges of many wiki projects (including all national wikipedias) in order to authomaticaly archive new links as soon as possible after the editors added them to the articles." You see? Archive.is is designed specifically to re-host Wikipedia links, for the purpose of having stable references. It's whole reason for existence is to be used to link from Wikipedia articles, rather than linking to the original web sites which, of course, are under the control of those unreliable copyright holders. What the founders of Archive.is seem to have overlooked, is that Wikipedia needs to scrupulously adhere to the copyright laws of the countries where it operates, including the US, and this prohibits them from linking to unauthorized re-hosted copies of copyrighted material.Weakestletter (talk) 14:36, 24 September 2013 (UTC)[reply]
You are right, there must be "User:RotlinkBot monitors ..." not "Archive.is monitors ...". It is known from User:RotlinkBot comments on Wikipedia talk pages, not from Archive.is FAQ or Twitter. 88.15.83.61 (talk) 16:13, 24 September 2013 (UTC)[reply]
It is also not clear if User:RotlinkBot is still doing it after he/she/it was banned on Wikipedia. 88.15.83.61 (talk) 16:31, 24 September 2013 (UTC)[reply]

filter misbehaviour

the filter which prevents adding links to the pages on the archive.is website also prevens adding links to the Archive.is article. The links such as [[Archive.is]]

Country blocking

Information about the country blocking is self-evidently available on the web.

You can easily google for currently active proxies in the countries in question and then run something like "Chrome.exe --proxy-server=socks5://37.27.205.217:35101 http://archive.is"

Removal of archived content

The article needs to be updated, as there is a "report" button where it's possible to report archived content to be taken down for a wide variety of reasons. nyuszika7h (talk) 09:24, 16 April 2016 (UTC)[reply]

Hatnote

What's better here - "Not to be confused with Internet Archive." or some variant of "For the San Francisco-based nonprofit website at archive.org, see Internet Archive."? User:94.230.146.228 is concerned that by being specific we're implying that the two websites are connected, but I think it's more misleading to say "not to be confused with Internet Archive" because that can be easily read as "not to be confused with archiving on the internet in general" - a reader actually looking for archive.org (without knowing its URL) might not think to click that and assume that they're already at the right article. --McGeddon (talk) 10:02, 11 May 2016 (UTC)[reply]

Perhaps link to Archive.org in the hatnote, which is a redirect to Internet Archive? nyuszika7h (talk) 20:18, 11 May 2016 (UTC)[reply]
Saying "For the San Francisco-based nonprofit website at archive.org, see Internet Archive." has a false connotation of "archive.is is sort of archive.org but for-profit" or even "there is a single company with non-profit and for-profit products; the former is archive.org and the latter is archive.is"; Anyway "...for non-profit see ..." implies that the following text is about something which is not non-profit albeit archive.is is non-profit as well 94.230.146.228 (talk) 16:37, 12 May 2016 (UTC)[reply]