Talk:Web archiving

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

New category[edit]

There's no good topic category for Web archiving. This makes it hard to find some pages (eg file formats used for web archives) and has resulted in the set category Web archiving initiatives possibly being overused. I have created the category Web archiving as a child of Digital preservation and will leave it a few days to see if anyone objects to this before manually populating the category with the other pages I think belong in it Zosterae (talk) 15:30, 7 December 2015 (UTC)

Query: Archiving webpages produced by database queries.[edit]

It seems to be difficult presently to arrange for an archival copy of a webpage that is produced as a result of a database query. The issue comes up, for example, at the Internet Movie Database, when one makes a query for all films involving 2 collaborators; the page that is produced is not readily archived by on-demand services such as WebCite. This issue leads to 2 questions: does anyone know if there's a solution to the problem, or has anyone written about the problem so it can be noted in the present article? Easchiff (talk) 20:46, 18 January 2009 (UTC)

A page has to have a URL link of its own in order to be archived. If it doesn't, you obviously can't post or cite the link per se, much less archive it. If it's a short page with not too much information, sometimes a solution is to copy and paste the information somewhere, perhaps in a subpage on an article's or user's Talk page, if you just want to preserve the information for somewhat temporary future reference. But the thing is, if a webpage doesn't have its own URL, then it likely isn't anything that would be used on Wikipedia as a citiation or External Link anyway. Softlavender (talk) 09:48, 18 July 2009 (UTC)

Archive blocking[edit]

Blocking the archival of TOS and privacy policies seems notable to me. Any thoughts on whether the reasons in the edit summary of this edit make it meritorious? It's mine and was just undone. --Elvey (talk) 21:24, 28 June 2010 (UTC)

No answer; will attempt a compromise edit. (Is this another (less interesting) example of archive blocking: vs ? Forbidden(403) is not the same as Page Not Found(404) but I suppose this could be a WebCite bug. works; I suspect this is irrelevant, but only WebCite staff has the access necessary to really answer this one.) --Elvey (talk) 18:20, 3 July 2010 (UTC)

It seems more folks are preventing/blocking the archival of TOS and privacy policies. I just tried to archive the Merrill Lynch Brokerage Website Terms and Conditions as of June 18, 2010 (the date they were last changed), and not only was I unsuccessful, it triggered the locking of my account! Spent over 40 minutes on the phone getting the account unlocked, and they also helped me navigate to a PDF of the terms and conditions, but they block from archiving it; here's the archive attempt, which also shows the full URL: . They'll look into it and get back to me; it'll be interesting to see if anything changes. It's not archived by google. ( I'm trying to add the URL to google's index. I just successfully added it to the list of URLs google intends to crawl...someday. I will be very surprised if google archives it, as that requires ML to treat Google differently from WebCite, and for google to get around to doing the crawling, and to choose to index and archive the PDF. ( Merrill is happy to serve, and WebCite is happy to archive, other PDFs, e.g.: The website's search feature only finds the T&C if the search is done by a logged in user; the result is hidden when the search is done otherwise. --Elvey (talk) 22:48, 20 September 2010 (UTC)

Web archiving#On Demand[edit]

Aside from marketing jargon, the commercial services are functionally identical. Some are on-demand, some offer scheduled backup services. IMHO they should all be listed with identical common terminology. --Lexein (talk) 19:33, 7 July 2010 (UTC) Blacklisted?[edit]

Why is blacklisted? Is it merely because it gets cited often in references?

I looked at the blacklist page, but it's pretty confusing:

As I understand it, all is, is a web archive. Sempi (talk) 04:58, 1 November 2011 (UTC)

WARC Tools[edit]

Following the link in the article, it appears that WARC Tools has been bought out by Symantec? I don't see any source code or downloads listed anymore, except for .pdf Sempi (talk) 05:45, 1 November 2011 (UTC)

Searh Tool Forbidden Site[edit]

In this section, Search Tool of Google Code is listed, but, it can not be accessed. --Tito Dutta (Send me a message) 22:57, 29 January 2012 (UTC)

Big list of enterprise and subscription services[edit]

Do we need Web_archiving#Enterprise_and_subscription_services section ?

It is big list of the enterpise services which are very expensive (for example PageFreezer subscription costs $50.000/year) and have no version open for public use.

Perma.CC and Wikipedia[edit]

One question - requires that a link be used in a published journal and verified before being stored permanently. Will it "verify" links being cited in Wikipedia articles if submitted? We don't want to lose the content of one of the best open source "journals" in the world.Mdawn (talk) 16:55, 29 September 2013 (UTC)

Transactional Archiving of Remote Sites[edit]

Why is this stated to be impossible? All you need is an remote controlled browser that cannot bypass the intercepting proxy. See (talk) 11:18, 26 January 2014 (UTC) finds pages that have the same content[edit]

Cool example:[dead link] (who's Nathan?)--Elvey(tc) 21:59, 15 November 2014 (UTC)