Jump to content

Template talk:Webarchive

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 109.78.242.41 (talk) at 13:54, 22 July 2018 (Preview error please?). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Quick reference

WP:WEBARCHIVES gives details of (all?) available web archiving services. This may be helpful when discussing this template and its Lua module.

Yeah I made that page, and this template, the template is largely up to date displaying those archives. But thanks for linking to that page it is relevant. -- GreenC 15:48, 2 December 2017 (UTC)[reply]

One template

Hi folks, {{webarchive}} is an old redirect to {{wayback}} no longer in use. I'm thinking of starting a new project in Lua to repurpose this template to aggregate {{wayback}}, {{webcite}}, {{memento}}, {{cite archives}}, (any others) into a single generic web archive template. It would also add new features listed below. This came out of the discussion at Wikipedia:Templates_for_discussion/Log/2016_October_7#Template:Cite_additional_archived_pages.

A number of problems were identified with the current methods:

  • Lack of support for multiple archive links such as exists in {{cite archives}}. This is true in CS1|2 templates like {{cite web}}; and in templates like {{wayback}}
  • Lack of support for archives other than wayback and webcite. There are dozens of web archive services but we only have few templates available and it would be impractical to make a new template for every service.
  • Confusion of methods and documentation across multiple web archival templates.

The solution:

1. Create a new template {{webarchive}} in Lua. It will take parameters like this:
|url=https://web.archive.org/web/201609010000/http://example.com
|date=9 September 2016
  • It would have standard options like title and nolink; and new support for multiple URLs eg. |url2=
  • It would detect the archive type by domain name, and make sure the rendering mirrors that of any existing template as closely as possible. So it would be possible to replace {{wayback}} with {{webarchive}} and the output would look the same and not break pages.
  • It will have tracking categories for errors which most templates currently lack.
  • It will have CS1-style red inline error messages to alert editors to problems.
  • A single documentation page, instead of different docs and methods for each template.
2. Create a bot to change all instances of old templates to the new template.
3. Retire/delete the old templates.
4. Start discussion with CS1|2 to see if they would be willing to add support for multiple archive links such as |archiveurl2=.
5. Check with IABot it will be able to use this new template.

I've already converted {{wayback}} to Lua so have some experience. It will take some work to develop and test but don't foresee any roadblocks technically, just time.

Linking here from the talk pages of relevant templates. Also @Evad37 and Cyberpower678:. -- GreenC 15:50, 9 October 2016 (UTC)[reply]

IABot would need to be reconfigured to support this change.—cyberpowerChat:Limited Access 19:48, 9 October 2016 (UTC)[reply]
  • Question: Is the template output going to comply with Citation Style 1? Keep in mind that it seems CS1 is wrongly conflated with CS1 templates which are applications of the style and not the style itself. As a result, the style elements of CS1 are poorly defined, but general guidelines can be intimated. These regard: the order of displayable values, their visible interdependencies, separators, text formatting, terminal punctuation and static (pre-inserted) text. I am bringing this up because the output of templates like {{wayback}} which the proposed module will replace, does not comply with CS1. 65.88.88.127 (talk) 17:47, 18 October 2016 (UTC)[reply]
The output of {{webarchive}} mirrors {{wayback}} and {{webcite}}. These templates were never designed to be CS1 because CS1 templates have their own support for |archiveurl=. There's also 85,000+ instances of the templates so to change the output would possibly break many pages. If we wanted a new style of output, as an optional argument switch, that could be done (or make the legacy style the option and the new style the default). For {{cite archives}}, it was a sort of workaround to CS1's lack of support for multiple archives, so the goal there is to get support for that in CS1 and retire {{cite archives}}. In the mean time, {{webarchive}} can support it with an extra option |format=. -- GreenC 21:55, 18 October 2016 (UTC)[reply]
The question wasn't about compatibility with CS1 templates, it was about compatibility with CS1 as a style. The new template can be compatible with the style independent of CS1 templates, CS1 modules etc. The question is, will it? 65.88.88.126 (talk) 22:54, 18 October 2016 (UTC)[reply]
I think the intention is:
  • "no" by default (so as to reproduce existing templates' behaviour, except for {{cite archives}})
  • "yes" with the |format= parameter, if placed at the end of a CS1-style citation (to reproduce {{cite archives}}'s CS1-style behaviour)
  • that if/when CS1|2 module/template support is available, the {{cite archives}} behaviour would be deprecated and eventually removed, as it would be redundant to specifying the additional archives in CS1|2 templates directly - Evad37 [talk] 02:45, 19 October 2016 (UTC)[reply]
Exactly. -- GreenC 03:03, 19 October 2016 (UTC)[reply]
Thank you. 65.88.88.126 (talk) 12:27, 19 October 2016 (UTC)[reply]
  • Comment: I feel that the url parameter should be the original canonical URL. The *-date parameters could instead specify witch archive service is being used ie. wayback-date, webcite-date, etc. The inclusion of multiple URLs (ie. url2 etc.) should alert editors that the page was moved. How you would do this? Regards. – Allen4names (contributions) 18:30, 19 October 2016 (UTC)[reply]
What you are proposing seems like a different template altogether. The design goals for this template are set out above, and they may not be trivial. Let's not unnecessarily complicate things. 72.43.99.130 (talk) 19:42, 19 October 2016 (UTC)[reply]

Proposed changes: |format= and |via=.

Is it possible for |format= to acquire a more general role? The way I understand it, now it just signifies CS1 compliance. This maybe a waste of a parameter. I recommend that "format" take any of the following values: wayback|webcite|cs1|cs2|memento, and then display the results accordingly. If that is too much work, then maybe start with the most heavily used styles? Additionally, I would like to ask that |via= be included, with a function/style similar to the one |via= has at CS1. Personally I think that the current nomenclature ("at Wayback" etc.) should not apply to items that are retrieved. Neither is the link at the related repository. An available copy is there; the link lives in the template code, and the retrieved item is likely on the user device cache. However, it is retrieved via the repository. 72.43.99.130 (talk) 20:00, 19 October 2016 (UTC)[reply]

Initial, rudimentary, observations.

As expected, today's version results in faster page loads when compared to use vs. {{cite archives}} (real-world examples from Order of the Star in the East). Again compared to {{cite archives}} and also as expected, results in heavier resource use (larger argument size, visited nodes etc.) So far, so good. 65.88.88.75 (talk) 20:52, 20 October 2016 (UTC)[reply]

Archive wikiwix

Hello,

I am Pascal Martin from Linterweb, our Wikiwix service archived since 2008 more than 100 million Francophones and Anglophones source links on Wikipedia. Our system is based on a detection of real-time links on Wikipedia and backup the content of external links without compromising the noarchive tag.

Then, I am coming to you to offer you to supply the template Webarchive through our archives, simply by http://archive.wikiwix.com/cache/?url=http://www.letelegramme.fr/ig/generales/regions/cotesarmor/coat-an-noz-le-chateau-retrouvera-son-eclat-01-08-2011-1386817.php

Please, note that I am the manager of a small company, my goal is not to make money with archives but to propose an Alternative to content saveguard and give some big data for the europeen research.

Indeed, since December we will deploy our technology to the entire corpus of the Wikimedia Foundation and in all languages.

We are hosted by the French University Network.

Sincerely, Pascal Pmartin (talk) 18:41, 4 November 2016 (UTC)[reply]

Pmartin, thanks for contacting. I look at wikiwix a week ago and had some questions. English Wiki has 10s of millions of external links and we use bots to automate most of the archiving.

  • I could not find an API or method to determine if an archive is available. For example this http://archive.wikiwix.com/cache/?url=http://www.nowork.zzz returns a status code of 200 even though the page is not available. It should return 404 in this case, or whatever the status code of the original page was. Otherwise bots will not be able to verify if the page is available and working.
header "http://archive.wikiwix.com/cache/?url=http://www.nowork.zzz"
  HTTP/1.1 200 OK
  Date: Fri, 04 Nov 2016 19:19:08 GMT
  Server: Apache/2.2.22 (Debian)
  • There is no date. Is there a way to know when the page was archived? This is important as we keep tracking of archive dates since pages change over time.
  • Will the link disappear? I recall reading that links on wikiwix.com are deleted if the link on Wikipedia is deleted. It's unclear which language of Wikipedia this is tracked or how stable the archive cache is.
  • Archive.org has been adding all links from all languages so there is overlap and they have excellent API tools and reliability.

-- GreenC 19:34, 4 November 2016 (UTC)[reply]

"User-agent: ia_archiver
Allow: /about/privacy
Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet
Allow: /full_data_use_policy
Allow: /legal/terms
Allow: /policy.php
User-agent: ia_archiver
Disallow: /"

Pmartin (talk) 19:57, 4 November 2016 (UTC)[reply]

Pmartin, that's great your willing to make changes. Yes I think we would need only two things, an API to check if a URL is available, and the date - for example http://archive.wikiwix.com/cache/?url=http://example.com&date=20160901120101 .. or date retrievable via an API. It might also be good when archiving a page, save the status code (also retrievable via API). Very often pages are archived non-200 but that information is lost and there's no way to determine later if the page is any good. Archive.is has this problem making it almost useless for bots since they have a high rate of non-200 pages. Wayback tells you the original page status which is very important for maintaining links. Also, what is Wikiwix policy of deleting links from the cache, are they deleted automatically if the original link is deleted from Wikipedia? -- GreenC 20:14, 4 November 2016 (UTC)[reply]

A good choice for the API might be the Memento API. It is supported by many existing archive services. —RP88 (talk) 20:18, 4 November 2016 (UTC)[reply]
I'm familiar with it and use it in bot. Unfortunately for whatever reason it's not always accurate, things get out of sync between what's actually in the archive database and what Memento says. This was particularly true with Webcite - emails to them went unanswered. The archive.is results are often 404 even though Memento reports 200. Most of the smaller archives like LOC and other national libraries work, but they don't usually have much content so rarely get hits. I think an API for Wikiwix could be very simple since they don't have multiple snapshots over time - a single date, page and URL. -- GreenC 20:33, 4 November 2016 (UTC)[reply]
Green I am not understand why do you need an api , if it s just to update the template webarchive to take a link in another archive ? --Pmartin (talk) 19:48, 7 November 2016 (UTC)[reply]
Here is how the template looks:
{{webarchive |url= http://archive.wikiwix.com/cache/?url=http://www.letelegramme.fr/ig/generales/regions/cotesarmor/coat-an-noz-le-chateau-retrouvera-son-eclat-01-08-2011-1386817.php}}
Produces: Archived (Date missing) at Wikiwix
Unfortunately a date is required or it gives a red error. This mirrors {{citeweb}} which requires |archiveurl= and |archivedate=. If a date is not provided, it too will give a red error message. There is broad consensus that archives should provide the date for source verification. Wikiwix could have the date in the top grey bar, where it says "This page is a cached version of this URL," then editors can manually type in the date into the template. The API is needed if you want bots to automate adding Wikiwix links to Wikipedia, which is recommended otherwise it won't get much usage from manual additions alone. -- GreenC 21:38, 7 November 2016 (UTC)[reply]
FYI https://fr.wikipedia.org/wiki/Discussion_utilisateur:Pmartin#I_left_you_a_message.21 we are talking about "Exclusive Solution" of IABot--Pmartin (talk) 01:16, 24 November 2017 (UTC)[reply]
Hi GreenC, i'm johan, a technical ressource of linterweb. We added an API like request to request datas on archive.wikiwix.com about an URL : http://archive.wikiwix.com/cache/?url=http://www.linterweb.fr&apiresponse=1 , We also added the date in the webpage content at bottom and a long form url with the datetime inside, like http://archive.wikiwix.com/cache/20180329074145/http://www.linterweb.fr . Maybe could you help us to add wikiwix as a known archiver in Module:Webarchive ? --Johan linterweb (talk) 07:53, 30 March 2018 (UTC)[reply]
@Johan linterweb:That's great news. The Webarchive Module itself shouldn't need an update (it already recognized Wikiwix URLs), but other things need to be done:
  • Search for all instances of Wikiwix URL's on en.wikipedia and convert to long form eg. http://archive.wikiwix.com/cache/20180329074145/http://www.linterweb.fr
  • If in a citation template, add a |archivedate= (if not already)
  • If in a webarchive template, add a |date= (if not already) - this will fix red errors
  • Update WP:WAYBACKMEDIC so that it includes Wikiwix in its list of web archive services to search when looking for new archives. WMedic will be able to add new Wikiwix archives into Wikipedia.
  • Update WP:WAYBACKMEDIC so that it does the first three steps automatically going forward so it can maintain the system should users add Wikiwix URLs in the short form.
I'll start looking at this soon. -- GreenC 16:10, 30 March 2018 (UTC)[reply]
@Johan linterweb: - Question: is it true that links are deleted from WikiWix if they are deleted from the French Wikipedia? This would be a problem at English Wikipedia, if the wikiwix link stopped working because of removal at the French Wikipedia. -- GreenC 16:29, 30 March 2018 (UTC)[reply]
Hi GreenC, great news ! About your question, the answer is no, we don't delete links from our archives even if they are removed from frwiki (because everyone can use wikiwix to archive a weblink, not necessary an external link from frwiki). --Johan linterweb (talk) 11:43, 5 April 2018 (UTC)[reply]

Hello @Johan linterweb:, that's excellent links are preserved. I'm working on updating WaybackMedic to add new WikiWix archives (for links marked with {{dead link}}) and the WikiWix API is returning many soft-404s. For example [1]. This will require manual checking which means new additions would be small since it can't be fully automated. Do you know what percentage of WikiWix archives could be soft-404? Initial tests show it might be as high as 50% (when checking the API for links on Wikipedia marked with {{dead link}}). -- GreenC 14:57, 7 April 2018 (UTC)[reply]

Hi GreenC, well we don't know how many archives could be soft-404, we didn't mark them as 404 pages in the past, from now we will. --Johan linterweb (talk) 08:49, 12 April 2018 (UTC)[reply]
@Johan linterweb: I've once again blocked Wikiwix-bot on the IABot Management Interface for adding archives like http://archive.wikiwix.com/cache/20150618143943/http://business.financialpost.com/news/retail-marketing/dollarama-tests-market-in-latin-america-with-sourcing-deal to IABot's DB which are bad.—CYBERPOWER (Chat) 00:18, 9 April 2018 (UTC)[reply]
In a span of 20 minutes, the bot added roughly 200 bad archives.—CYBERPOWER (Chat) 00:21, 9 April 2018 (UTC)[reply]
Hi CYBERPOWER (Chat), as i said you on pmartin's page, we have fixed the problem (due to meta tag "noarchive" detection). It would be simplier to write us in pmartin's page about problem between wikiwix and IABot, it's not related with wikiwix integration in webarchive module or template. --Johan linterweb (talk) 11:58, 12 April 2018 (UTC)[reply]
  • Hi GreenC, we fixed archive wikiwix code to detect soft404 URLs (using patterns like you suggested) and restarted to push our archive URLs to IABot (on frwiki but it can get side effect on others projects). We hope the detection rate is quite good. We also ran a bot internally to fix all our archives datas.--Johan linterweb (talk) 08:17, 27 April 2018 (UTC)[reply]
Ok. I know from experience with archive.is this filter method requires continual monitoring and updating and it's not 100% perfect. What is the rate of 404s that get through the filter? I'll be able to monitor the rate when I run WaybackMedic as I do manual verification. --GreenC 14:44, 27 April 2018 (UTC)[reply]

Module vs template

The module page still states that it is not ready for article space, but the related notice has disappeared from the template that invokes it. Which one is correct? 72.43.99.146 (talk) 02:11, 15 November 2016 (UTC)[reply]

Fixed, thanks. -- GreenC 05:24, 15 November 2016 (UTC)[reply]

Portuguese Web Archive

Portuguese Web Archive is misspelled, i.e., "Archived August 1, 2016, at the Portuguese Web Archive". 79.76.183.103 (talk) 20:32, 20 January 2017 (UTC)[reply]

Fixed. -- GreenC 05:26, 21 January 2017 (UTC)[reply]

|url= vs. |archiveurl=

It's very annoying that |url= is actually the generic web archive URL, not the deadlink. It's not possible to specify a specific snapshot (like is done with the |archiveurl= parameter in {{cite}}) using this template. This was proposed by many people in the TfD discussion, but no one commented on, or addressed it. Why does this parameter exist then?

Then |date= is actually the date the link died, not the date of the snapshot. This is inconsistent. |url= is the original URL, which is automatically linked via The Wayback Machine, so one would assume if the specific Wayback URL can't be used that the date parameter is for pointing automatically at the snapshot in question, but it's not, and it deals with the original URL's date.Hexafluoride Ping me if you need help, or post on my talk 18:15, 10 February 2017 (UTC)[reply]

|date= is the date of the snapshot. |url= is the archive URL not the dead url. There is no need for the dead url it's not a CS1|2 or citation template. The documentation page has examples how the template works. -- GreenC 19:04, 10 February 2017 (UTC)[reply]
Note that the parameter naming issue was definitely addressed/discussed by several users at TfD – have another read of Wikipedia:Templates_for_discussion/Log/2016_October_24#Web_archive_templates – for example, note that RCraig09 even struck that part of their !vote following GreenC's response. If you want to propose changing parameter names, or adding parameter aliases, or whatever, you can make such a proposal, but the "no one commented on, or addressed it" claim is simply not accurate. - Evad37 [talk] 19:35, 10 February 2017 (UTC)[reply]
I'm very sorry. I've got things mixed up. This should be at {{deadurl}} not here. This template works as intended. I was reading the documentation for {{deadurl}}, then read the TfD discussion and somehow the two wires crossed in my brain. —Hexafluoride Ping me if you need help, or post on my talk 19:38, 10 February 2017 (UTC)[reply]

Ranges for archive index?

Sometimes official sites change multiple times, so I'd like to see the archive index (for wayback machine citations) show a range of when a page was archived. For example one URL would be 1998-2001, another 2001-2005, and so on WhisperToMe (talk) 04:17, 9 July 2017 (UTC)[reply]

Not sure if Internet Archive supports ranges, how would that URL look? It is possible to link to the index with '*':
{{webarchive |url=https://web.archive.org/*/http://en.wikipedia.org |date=* |title=Enwiki}}
Enwiki at the Wayback Machine (archive index)
-- GreenC 13:51, 9 July 2017 (UTC)[reply]

Error category with no error text

László Nagy (canoeist) shows the "Webarchive template warnings" category, but I do not see any red error text, so it is difficult to figure out what to fix. – Jonesey95 (talk) 18:45, 1 December 2017 (UTC)[reply]

There is no |date= .. |date2= .. etc .. a date argument is required one for each matching |url= .. |url2= etc.. the template design decision was to silently let it go without a red warning since it can recover a date from the URL (usually), but still adds it to the warning category. -- GreenC 21:36, 1 December 2017 (UTC)[reply]
Jonesey95, my bot WP:WAYBACKMEDIC was able to fix most of these (missing dates). -- GreenC 19:17, 11 March 2018 (UTC)[reply]

Batch referencing books in the public domain

Archive.org has become a de facto library for books in the public domain, and it's widely referenced throughout Wikipedia. It'd be great to have a tool/bot that goes through an article, and adds bibliographical references to all publications cited in the text and wikilinked to a publication article (book, etc), if there isn't one already. I wouldn't know how to code it, but the algorithm seems easy enough to implement:

  1. Go through all wikilinks
  2. If it's linked to a publication article, search archive.org using the query:
    https://archive.org/search.php?query=title%3A%28'title'%29+AND+creator%3A%28'author'%29+AND+mediatype%3A%28texts%29&sort=-downloads
  3. Pick first result, and, if a reference is not already there with the same author and title, add it in.

Easy enough? — WisdomTooth3 (talk) 23:53, 2 March 2018 (UTC)[reply]

Idea: Date ranges?

For archive indexes I'd like the option to display a range of dates. Many organizations change official website URLs multiple times, so I'd like to have a range of official site URLs with different dates according to when that URL was maintained.

Thanks! WhisperToMe (talk) 22:00, 22 April 2018 (UTC)[reply]

Previous. -- GreenC 22:29, 22 April 2018 (UTC)[reply]
Also there is the |addlarchives= option that will allow for multiple snapshots. -- GreenC 23:09, 22 April 2018 (UTC)[reply]

Preview error please?

Would it be possible to make it so that this template cause an error or warning in Preview if an editor has put it inside <ref> tags? -- 109.78.242.41 (talk) 13:54, 22 July 2018 (UTC)[reply]