Jump to content

Wikipedia talk:List of web archives on Wikipedia

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Downtime reports

[edit]

Pings to some of the users who may be interested in maintaining this list and trying to think up ideas of what to do: @GreenC, Pawnkingthree, Thryduulf, and Alsee:. Boud (talk) 15:57, 1 October 2019 (UTC)[reply]

Please add updates within the per-site sections. (Unless there's a better suggestion.) Boud (talk) 15:57, 1 October 2019 (UTC)[reply]

Webcitation

[edit]
  • 2019-10-01: WebCite deprecation RFC. The consensus seemed to me overwhelmingly in favour of deprecating WebCite usage on Wikipedia because of sporadically repeated downtime over many years, though I don't claim this as a full summary of the full discussion - read the RFC yourself. As of 2019-10-01, WebCite does not accept any new archiving requests, with the message "We are currently not accepting archiving requests." Boud (talk) 15:57, 1 October 2019 (UTC)[reply]
Converting these to a new archive is tricky because of 'content drift' ie. a page has different content depending on when the snapshot was taken. This is particularly true of WebCite because for many years people used it precisely for this reason, to capture a page on a certain day/time (this was before archive.org had this ability) so WebCite links have a high rate of content drift. Thus transferring them to another provider requires they have the same timestamp available which most of the time is not true. The alternatives are to archive the WebCite page itself (not ideal), or to transfer the WARCS from WebCite to another provider (best). I believe there might be a possibility of the later but for now unable to confirm. Probably the best thing is to leave links as-is unless you can find a new archive without content drift (manual process). -- GreenC 16:52, 1 October 2019 (UTC)[reply]
  • 2022-01-31: Attempts to access existing webcitation.org archives are reporting "DB Connection failed". I don't check regularly, so this may have been going on for several weeks. If it's working for you, please post so that I know it's a personal problem. Thanks! Fabrickator (talk) 05:03, 31 January 2022 (UTC)[reply]

Wikiwix

[edit]
  • 2019-09-30, 2019-10-01: Over about the last 24-48 hours, I have been able to connect to wikiwix.com, but unable to connect to archive.wikiwix.com -- Error message (firefox):
 The connection was reset
 The connection to the server was reset while the page was loading.
 The site could be temporarily unavailable or too busy. Try again in a few moments.
Boud (talk) 15:57, 1 October 2019 (UTC)[reply]
We (enwiki) have around 4,000. At some point I ran a process to migrate them to other Archive providers when possible, for a number of reasons, and got most of them. These remaining are orphans with no other option. If they are permanently dead, so it is. But give them some time. The frwiki has millions of them so there might be an effort to restore eventually, the data presumably still exists. -- GreenC 16:43, 1 October 2019 (UTC)[reply]
Works ok for me: http://archive.wikiwix.com/cache/20100826150209/http://articles.latimes.com/1992-12-22/business/fi-2407_1_teen-talk -- GreenC 21:40, 17 January 2020 (UTC)[reply]
Ah yes, it's been up since some time 17 January UTC - I meant to put a note and forgot. So maybe 30 hours or so of downtime: annoying, but not the end of the world. Boud (talk) 00:42, 18 January 2020 (UTC)[reply]
Hello, I would like to find out where the integration of several archivers into the database is going.:
My request is a little old: https://fr.wikipedia.org/wiki/Discussion__utilisateur:Pmartin#Task_4:_IABOT_DB_expanding_for_others_archive_services_(Cyberpower678). We compressed the data Well, it fits on a single server. And I have at least 10 years of production ahead of me. So I'm starting to make a backup of the links in the IABot database. https://phabricator.wikimedia.org/T358978 Pmartin (talk) 18:15, 3 March 2024 (UTC)[reply]
@GreenC https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives Is it possible to add me to this list as well, even if I'm not part of the book? Pmartin (talk) 18:33, 3 March 2024 (UTC)[reply]

Wayback machine/web.archive.org

[edit]
  • 2019-09-30, 2019-10-01: Over about the last 24-48 hours, accessing archived pages has worked correctly, but for periods of several hours, sporadically,
 https://web.archive.org/save/https://URL.to.be.backed.up 
gives a blank page in firefox. Boud (talk) 15:57, 1 October 2019 (UTC)[reply]
I'm reasonably convinced form experience that this is an anti-overload-per-user limit. So this is reasonable. Boud (talk) 01:56, 16 January 2020 (UTC)[reply]
Yes, this exists. When doing parallel searches or requests ot will slow down, and depending on load possibly time out requiring a retry. It depends on how busy the servers are it is dynamic. -- GreenC 00:40, 26 February 2020 (UTC)[reply]

Server overload/file management problem

[edit]
  • 2020-02-19 Since a day or so ago, pages that are archived seem to become unavailable. As an example right now,
Hard to know what is happening. Load, indexing error, down server/disk, intentional removal or some other problem. -- GreenC 22:35, 19 February 2020 (UTC)[reply]
@GreenC: Leaving aside speculation on the hardware/software explanation of the problem (in at least one case I'm sure that the server was rewriting or redirecting URLs), do you have this problem too? At first I thought I had messed up in terms of copying URLs correctly. It would be nice to have confirmation that this is not just a problem for my IP/browser/geolocation. Boud (talk) 22:58, 19 February 2020 (UTC)[reply]
Yes confirmed. I have seen this before when running WP:WAYBACKMEDIC which is a bot that tests archive URLs to ensure they are still working. I don't specifically log "Hrms" (probably should), but when encountered it tries to find a new archive provider such as archive.today .. there are a lot of error pages at Wayback the "Hrm" is one type. It would be interesting to watch and see if it comes back alive at some point. -- GreenC 00:24, 20 February 2020 (UTC)[reply]
@Boud: The article now exists in the archive . Also, when I checked yesterday it didn't exist so IDK what happened. --MrLinkinPark333 (talk) 00:42, 21 February 2020 (UTC)[reply]
Thanks for the feedback - this helps our hand-edited logs. So the problem is intermittent. Interest in using Wayback is unlikely to be dropping - there must be a huge pressure on the servers (and software, given the complexity of web pages) - so there are all sorts of reasons why it might have some intermittent problems. Boud (talk) 00:52, 21 February 2020 (UTC)[reply]

Archive.today

[edit]
  • 2020-02-25 Since a few minutes ago, downloading archive.ph (archive.today seems to have a redirect to archive.ph) URLs I get a refusal to show me the pages if I leave Google disabled in uMatrix. Since Wikipedia has no reason to support totalitarianism - it's the opposite of what Wikipedia is all about - this apparent change by archive.today would appear to disqualify it from regular usage on Wikipedia. I haven't tried enabling GAFAM; there is google.com, henley-putnam.edu, and even the evilest part of GAFAM, google-analytics.com, which are now all called by archive.today. So I have no idea if downloading the pages is accepted if the user chooses to allow one or some of these third party sites. Boud (talk) 23:31, 25 February 2020 (UTC)[reply]
Given the option of having a dead link vs archive.today most users would prefer the option of archive.today -- with some kind of warning in the documentation could be appropriate. How important this is will vary, many people won't care at all. Most of use Google etc for free with the understanding (or not) they are collecting data about us, so if archive.today is part of the borg that is unfortunate but not a reason to disqualify otherwise many important sites would be blacklisted. This is reason number 99 for Wikimedia to run an archiving service. -- GreenC 00:49, 26 February 2020 (UTC)[reply]
Here, "disqualify" counts as my WP:OR ;) - I realise that, for the reasoning that you state, my "disqualify" judgment doesn't stand a snowball's chance in hell of gaining consensus. Wikiwix is a heroic (AFAIK) effort in the direction of what you propose, but I suspect that a much bigger dedicated geek team + server resources + bandwidth would be needed; something to be proposed to the WMF Board. In terms of ability to handle various web pages' various javascript functionalities, archive.today seems to have fewer rejections ("cannot save that page!") than Wayback, and Wayback seems to have fewer rejections than Wikiwix (e.g. Washington Post, New York Times, Le Monde). Boud (talk) 20:29, 26 February 2020 (UTC)[reply]
That problem only lasted for a few days - it switched back to using non-google services, at least for me. Boud (talk) 16:32, 15 March 2020 (UTC)[reply]
  • 2020-03-15 Maybe as a consequence of the 2019–20 coronavirus pandemic isolating people around much of the world at home and increasing internet usage in general, and maybe even from pupils and students who check references (am I being hopelessly optimistic/naive? :)), archive.today has been heavily overloaded. What is very effective is its choice to notify the user about what position his/her request is in the queue. Right now I have a request at about the 4400-th in the queue, for the sort of URL that Wayback tends not to handle so well. Boud (talk) 16:32, 15 March 2020 (UTC)[reply]
Another option for one-off saves of links that Wayback doesn't do well is webrecorder.io is probably the best saving technology of all (with a few exceptions). It is limited though by disk space for each account, and it's long-term existence who knows, seems to be dependent on a few grants not yet tested by an economic downturn. I use it for embedded videos and scrolling/slideshow content. -- GreenC 18:07, 15 March 2020 (UTC)[reply]
  • 2024-05-25, 2024-05-26 Since yesterday or so, most of the archive.today domain names, including the TOR site - http://archiveiya74codqgiixo33q62qlrqtkgmcitqx5u2oeqnmn5bpcbiyd.onion - have given a Welcome to nginx! announcement. I got the usual interface for saving or searching a few times, but after clicking eventually got the Welcome to nginx! announcement again. As stated at our Wikipedia article archive.today, this seems to be very much a random person in Nebraska problem, though even worse, because all we know is that the maintainer is someone highly altruistic, with good geeky skills and good internet access, somewhere in the world. Could have had a car accident, fallen in love, been eliminated with Novichok, had a severe disk failure, difficulties in upgrading the OS, or just has autistic burnout. Hope s/he recovers personally and gets the system running again... Boud (talk) 15:13, 26 May 2024 (UTC)[reply]
    The site is up. I'm using it constantly during that period and have not experienced downtime. I have seen the "Welcome to nginx" before but this was due to my DNS resolvers having bad data. I was able to solve it by switching to a new DNS resolver and rebooting my router to clear the old data. - GreenC 16:53, 26 May 2024 (UTC)[reply]
    I guess it probably must have been a temporary issue with DNS resolvers in some places in that case. Glad it was nothing more serious than that. Access looks OK to me now using the same systems as before. Boud (talk) 20:56, 28 May 2024 (UTC)[reply]
    I'm not convinced that I have a DNS resolver problem. Checking this from the command line with wget https://archive.today/?run=1&url=${URL} gives me five tries, with a roughly two-minute delay between each, ending with HTTP request sent, awaiting response... 429 Too Many Requests. The specific IPs vary on independent tests, which is consistent with round-robin DNS resolution taking place correctly. So it seems like an overload problem: either deliberate DOS or DDOS or just the problem of popularity. The person running the service apparently does not have mega server centres like WMF or GAFAM. Boud (talk) 20:37, 29 May 2024 (UTC)[reply]
    Maybe rate limiting policy, thinks your a bot. 429 might also be the response for captcha. -- GreenC 21:41, 29 May 2024 (UTC)[reply]
    Restricting access to requests that don't look like browsers seems like a reasonable restriction against bots, especially if they are requesting a page to be archived.
    Since you seem to have access, could you please check if there are announcements of any alternative TOR URLs than http://archiveiya74codqgiixo33q62qlrqtkgmcitqx5u2oeqnmn5bpcbiyd.onion ? At archiveiya74codq..biyd I mostly only get Welcome to nginx!, although clearing cookies got me through to some normal 'archive.today' pages, but the step to save a new page got back to Welcome to nginx! again.
    The problem for me is not just new requests, but reading already archived pages. If a fair fraction of people have the same access problem reading already archived pages, then the utility of the archive is greatly weakened. Most people checking Wikipedia sources won't think to themselves, "Good that the page is available in principle, I'll come back in a week or so to check later" or "Good to know that someone else can check the archived page". Boud (talk) 10:22, 30 May 2024 (UTC)[reply]

Archive.today preferred URL

[edit]

@GreenC: Based on https://phabricator.wikimedia.org/T245276, I assume that it's better to write archive.today URLs using literally archive.today rather than the redirects such as archive.ph or archive.vn. Is this right?

I'll let the bot fix existing usage, but I'm happy to manually write archive.today given the arguments in favour presented on that discussion page. Boud (talk) 18:43, 17 March 2020 (UTC)[reply]

Yes archive.today is what they (archive.today) prefer Wikipedia uses generally. The Phab ticket is for an edge case bug in IABot that is currently unfixed. We needed to convert about 700 links that met the special criteria (short form and uses ph md or vn) because IABot was seeing those and converting them to Wayback, thus making them archive.today jigged around the bug. -- GreenC 20:29, 17 March 2020 (UTC)[reply]

What does "No memento access" refer to?

[edit]

It is not clear what "no memento access" refers to. Does it mean the archives are completely inaccessible? Elominius (talk) 00:41, 1 October 2023 (UTC)[reply]

Memento is a distributed protocol that informs which archives are available at an archive provider, sort of like an API. [1]. There are two ways to access it, via a central database hosted at http://timetravel.mementoweb.org or DIY in which case your bot would query each archive provider individually. The DIY is more accurate since the local provider will have the most up to date information about their service. The central method is faster and easier, but the data may not be up to date. -- GreenC 04:45, 1 October 2023 (UTC)[reply]

Proposal to add a template

[edit]

I would like to create a template containing the parameters listed for most of the archiving services. I would also like to add the parameter "organization" for the organization sponsoring the service. Any objections to this? --Bensin (talk) 18:35, 21 December 2023 (UTC)[reply]

What's the purpose of the template? We already have {{webarchive}} which can accept up to 10 archive sources. -- GreenC 18:42, 21 December 2023 (UTC)[reply]
Not for that. I propose a template (something like "web archive service") taking parameters in each entry in the list of this list article: "Article", "Domain", "Launched" etc. --Bensin (talk) 20:27, 21 December 2023 (UTC)[reply]