Wikipedia talk:Featured article candidates/Citation templates (technical)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The technical explanations from Wikipedia talk:Featured article candidates/Citation templates are copied here to make them easier to find.

Israel is slow mainly because of citation templates (technical points only)[edit]

The slow page-load time SlimVirgin observes in Israel is not a figment of imagination: it's real. I can easily reproduce the problem by visiting version CITE of that article (which was current as of about 15 minutes ago), clicking on the "edit this page" button, and then pressing "Show preview". The resulting page has a little comment in its HTML that says "Served by srv178 in 32.837 secs." (your timings may differ if you talk to a faster or slower server), and the entire process took about 36 seconds real time. The bottleneck is on the Wikipedia server side, not in my network or browser. If I then visit version VCITE, which is identical to version CITE except that it uses {{vcite journal}} etc. rather than {{cite journal}} etc., and do exactly the same thing, the resulting page's little HTML comment says "Served by srv227 in 12.298 secs." and the overall process takes about 15 seconds real time (again, your timings can vary quite a bit). In my experience, switching from the {{cite journal}} to the {{vcite journal}} family improves the editing experience (i.e., the real time it takes to load a page) by a factor of two. The only thing that's changed is which citation templates are used: I'm not changing images or anything else. (The test with Israel isn't quite fair, as not all the vcite flavors are implemented yet so I fall back on the cite versions, also, I approximate citation with vcite book; but it's close enough.)

This demonstration disproves the above claim "The argument being put forward – that citation templates slow down load times – is patently false, or at best specious. The citation templates are the main bottleneck in Israel, and it's their execution on the server side that is killing overall performance when editing.

Part of the confusion in the above discussion comes because some editors are visiting copies of the page that are cached on the server side. These copies are generated quickly, avoiding the bottleneck. But when you're logged in, and particularly when you're editing, the cache doesn't work.

One other thing, independent of caching: version CITE's HTML is 42% larger than version VCITE's HTML, for no particularly good reason. This difference (610 kB versus 431 kB) is not a big deal if you have a fast network connection, but most of the Internet still uses slow (non-broadband) connections, where the difference bites hard. People in (say) Africa routinely shut off images to save money (they are typically downloading over cell phones), and I expect that they routinely avoid large wikipedia pages like Israel simply because even with images turned off it's too expensive to load. This is a real and important problem with our encyclopedia. Eubulides (talk) 22:15, 5 February 2010 (UTC)

Nobody has disputed that the Israel page is slow to load if it's not been cached; the discussion centres around why, and it's self-evident from the analyses that have been carried that it's almost entirely due to the number of html objects in the page. No doubt a few microseconds could be shaved off by removing all of the templates, but so what? ... Can you provide links to your test pages so that I can see for myself what it is that you're claiming? --Malleus Fatuorum 22:30, 5 February 2010 (UTC)
"No doubt a few microseconds could be shaved off" No, that's not what my benchmarks showed. Merely switching to faster citation templates shrank the time from 36 seconds down to 15 seconds, which demonstrates that the citation templates are the major bottleneck in Israel. And redoing the citations entirely by hand, as SV proposes, would shrink the time even further (I expect down to the 5–10 second range). Furthermore, my benchmarks showed a bloat of about 40% in the generated HTML, purely because of {{cite journal}} etc. as compared to better templates or doing it by hand. These improvements are not just "microseconds": they're major differences. The currently-popular citation templates have significant adverse effects on both editors and readers. I don't blame SV at all for not wanting to edit Israel: it's a terrible user interface if you have to wait 30 seconds after you press "Save page" to go on working. Eubulides (talk) 22:44, 5 February 2010 (UTC) PS. The links to the test pages are in my previous comment, but here they are again: version CITE for the current version, and version VCITE for the same page using faster citation templates. Eubulides (talk) 22:46, 5 February 2010 (UTC)
I tried these out, and Eubulides is right. {{vcite journal}} etc. is significantly faster than than {{cite journal}} etc. —mattisse (Talk) 23:35, 5 February 2010 (UTC)
Eubulides may be right that they download faster, but for the wrong reasons. I'd like to look at this myself, because I'm certain there's something else going on here, and it's got very little if anything to do with citation templates. --Malleus Fatuorum 23:39, 5 February 2010 (UTC)
There's an overriding principle in networked systems, which is to ensure that communication doesn't overwhelm computation. Computation is cheap and fast, communication is expensive and slow. I think we need to look more deeply at why the Israel page is so slow to load, and to focus on the slow communication issue – 91 (necessarily slow) http connections for one page is simply ridiculous. If it turns out that it's the templates that are generating those 91 html objects then that's the problem that needs to be solved, but right now it's by no means obvious what the problem is. So to call for any change in the FA criteria to address a problem not yet fully understood is to say the least premature. --Malleus Fatuorum 00:28, 6 February 2010 (UTC)
  • The symptoms are entirely those of server-side slowdown. The slow-page-generation probelm has nothing to do with communication between browsers and Wikimedia servers. It is true that the citation templates generate significant HTML bloat, but that is of concern only to users on slow (non-broadband) connections.
  • It's not just Israel. It's every page that uses the standard citation templates. The more templates uses, the slower the pages are generated. Once you have more than a dozen or two citations the slowdown is noticeable. Once you have a hundred or so it gets reeeeally irritating. Once you have two hundred it is so painful to edit that many editors give up. (These numbers are approximations; it partly depends on how complicated the citations are.)
  • You can try the same example with just the citations, by visiting my CITEONLY sandbox, which contains (as best I can arrange it) just the citations from Israel, and my VCITEONLY sandbox, which contains the same citations using {{vcite book}} etc. In both cases, click on "edit this page" and "show preview". In my measurements, CITEONLY took about 22 seconds of server time, and VCITEONLY took 5 to 6 seconds. According to websiteoptimize.com, CITEONLY and VCITEONLY required exactly the same number (38) of HTTP requests, and the number and type of external objects was exactly the same. The only difference that websiteoptimize.com reported was that VCITEONLY generated 21,259 bytes of HTML, whereas CITEONLY generated 41,821 bytes, that is, about twice as much HTML; this is the HTML bloat alluded two bullets above.
Eubulides (talk) 06:44, 6 February 2010 (UTC)
If the vcite templates are so much better, do they have a downside? otherwise, why don't we just erase the cite templates and move the vcite to the cite ones? Casliber (talk · contribs) 08:29, 6 February 2010 (UTC)
The vcite templates aren't yet complete: they cover the Big Four ({{vcite book}}, {{vcite journal}}, {{vcite news}}, {{vcite web}}) but not the other cite XXX variants. Also, these templates use Vancouver style rather than the quasi-APA style used by {{cite book}} etc. and not everybody likes this style even if it is a bit more efficient. One quick fix that could be done without worrying about style issues is to change the default in {{cite book}} etc. to not generate COiNS data and to trim away some of the relatively-useless links it generates: this would be a significant savings in HTML bloat and server CPU speed, though it still wouldn't make them as fast as the vcite family. Eubulides (talk) 08:40, 6 February 2010 (UTC)
Citation templates do not cause "slow load times throughout Wikipedia". They cause slow load times only when the page is not already cached. They're a problem mostly for editors, not for IP readers. Eubulides (talk) 22:20, 6 February 2010 (UTC)
These problems are definitely not "unusual" or "unique". They occur whenever the server needs to generate a page full of citation templates, which, when you're editing a page, is always. The problems are easily reproducible; just follow the steps I listed above. Eubulides (talk) 22:20, 6 February 2010 (UTC)

Random break[edit]

The question is not a difficult one to answer, but nobody is listening. There are two issues with templates: processing time on the server and potential html bloat resulting in larger html pages. Neither of these is the reason why the Israel page is slow to load; that's quite simply down to the number of objects on the page. --Malleus Fatuorum 16:52, 6 February 2010 (UTC)
  • Nobody is listening because Malleus's analysis is incorrect for how editors use the page. One more data point: I just now visited Israel as a logged in editor, and because the page wasn't cached and needed to be recomputed, it took about 25 seconds to load. The HTML ended with the line "<!-- Served by srv241 in 21.967 secs. --></body></html>", showing that the vast majority of this time was spent processing the page on the server. Whether this is due to the "number of objects on the page" or some other factor is irrelevant: what counts is that the page is slow to load, and (as the experiments above show) this is almost entirely due to the use of standard citation templates on the page.
  • "Somewhere on wiki there must be a knowledgable computer person who simply knows the answer to the question do cite templates make articles slower to download" I doubt whether there is any such person. The Mediawiki source code is messy in this area: it's obviously slow, but is contorted and will be hard to tune. I expect that the Mediawiki developers and Wikipedia maintainers would rather not think about the problem. However, there are a few signs of life: please see #Mediawiki-based solution below.
Eubulides (talk) 22:20, 6 February 2010 (UTC)

A long time ago I compared the load times of some articles with cite templates to the same articles with the citations written without any templates. The difference in load times was not significant, in my opinion. I just did the same for the page in question here. Server load times were .054 to .158s for both pages, but without a lot of sampling I couldn't say there was any difference. Actual load and render time after a cleared cache on the particular computer I was using at the time: 4.5s to 5.0s for either page. The no-cite version may perhaps render .3s faster. Maybe. If the no-cite version is actually faster, some of that difference would due to the no-cite version having about 8% fewer characters of wiki text (or about 4% fewer characters of html); the no-cite version also has somewhat fewer html tags, which might provide a few microseconds advantage, too. If anyone wants to check these tests, see User:Gimmetrow/test and User:Gimmetrow/test2. diff. I tried to replace the cite templates with equivalent text, rather than use the short refs mentioned by SV. Gimmetrow 17:01, 6 February 2010 (UTC)

This analysis is using cached times, for IP readers. But that's not the problem. The problem is when editors are logged in: often the pages aren't cached for them, if they have preferences set or whatever, so they need to be recomputed, even if they're just reading the page. And when they're editing the page, of course the page needs to be recomputed. Please redo the analysis assuming that the pages have not been computed and are not cached. Eubulides (talk) 22:20, 6 February 2010 (UTC)
I don't see much significant difference in load-time to a reader, but I wouldn't be surprised if preview takes longer to load with templates. Are you making an argument just for editors? Wouldn't any slowdown be mitigated by editing the page in small sections, as other factors tend to encourage anyway? Gimmetrow 17:45, 6 February 2010 (UTC)
Of course editing the page in sections will improve page-generation times, since you needn't wait for the whole page to be recomputed. However, one cannot always do that (some changes require access to the whole page, or you're editing the citation itself); besides, when one hits the "Save page" button, one has to wait for the 30 seconds or more to get the page regenerated, and this wait is seriously impeding editing. Eubulides (talk) 22:20, 6 February 2010 (UTC)

Rendering slowness / templates[edit]

Here is what I know about it technically:

  • Templates by themselves do not make loading an article slower. The templates are expanded by the servers when the page is saved, not when it is viewed.
  • The normal citation templates generate COinS metadata which makes the html sent to viewers quite a lot longer and more complex. The vcite templates by Eubulides do not, and I suspect this is why they display faster. I do not know what COinS is really useful for.
  • If you want to make Israel and other country articles render faster, I suggest taking a look at the navigation boxes hidden under "Articles Related to Israel" at the bottom. That is a lot of html to transmit and render for every page view, and how useful is it really?

--Apoc2400 (talk) 19:08, 6 February 2010 (UTC)

(This started off as a targeted response then overwhelmed me, so I'm just dumping it here as a general comment)
You guys are talking about several different things within the same breaths:
  • One issue is how long it takes for a reader to see an article. This depends mostly on the size of the page that is actually rendered in HTML after the wiki-software gets through with it, which page goes to a "squid" cache so it can be served up quickly for the next reader. In this scenario, the number of "objects" your browser downloads is not especially material though there is some extra transmission overhead for each object. The COiNS HTML overhead will obviously affect transmission time, if you "view source" of pages, you can figure out how much HTML data has been squeezed into the intertube.
  • Recently I believe they've moved the wikibits images to a dedicated server, which is incredibly annoying ATM because it seems that server doesn't close connections properly and it takes forever for the page to actually finish. Keep in mind though that your own browser also caches objects. Theoretically the wiki-globe logo only ever got downloaded once to your system. The time reported within the HTML source that Eubulides (I think) mentions as "server time" - that I think is the best indication of time-on-server, as opposed to time-to-average-reader once the page is squid-cached.
  • If you the editor have some different setting from the default for things like thumbnail size, skin &c., you will bust the squid cache and the HTML has to be rendered again from the parser cache (which has the representation of the article itself sans menu wrappers). This will take some time, although the squids will cache the pageview for the next reader with your exact same preferences.
  • Now if you're actually trying to edit the page, more factors come into play. When you edit (or purge) a page, you invalidate every single cache (in preview mode you're just pretending but the same stuff happens, just for you), it all starts again at the parser:
    • Sheer size of the wiki-text itself will affect load time, especially for dialup connections.
    • Template expansion. Anytime you put something within {{squigglies}}, you are telling the MW software to swap in the contents of the template page with appropriate parameters AND THEN SCAN THE PAGE AGAIN to see if one of the new parameters contains a template, which then gets expanded the same way. This is a recursive (re-entrant) function from my reading of the software and works at the page level - so the depth of template calls can become significant. Within the HTML source of a page you should be able to find something like "NewPP pre-expand size" and post-expand size (and the related "expensive parser functions"), these give a partial clue of how onerous use of templates is. This has nothing at all to do with your connection speed.
    • Database hits. A basic limiting factor in rendering a wikipage from scratch is how many times an actual disk-page must be retrieved. From that standpoint, when you use the {{convert}} template the first time, the servers have to get the actual copy of that file from the hard drive, then go through the recursive thing I talked about above. In the case of {{convert}} I know for certain sure that the server will have to also fetch at least one sub-page of the main template. But DB / parser-servers cache pages too, if my preview task happens to get assigned to a server where you've just referenced the same objects in the database, we'll likely be able to get that good 'ol edit conflict. Again, not connected to how fast your connection or computer is.
Bottom line: for lag time from the editor standpoint, I think this is an interaction between page size, number of templates (which seem to force the entire page to be re-parsed), and depth of template calls (which force a re-parse, only recursively). I think I saw Z-man commenting somewhere, he would be well-armed to blow me out of the water here. But that's my impression of how the whole thing works... Franamax (talk) 02:28, 7 February 2010 (UTC)

[no ec, but intended to post below MF 23:01] I've found this discussion interesting because of the variety of factors involved and how people tend to talk past each other. My precis: Eubulides has presented simple-to-verify evidence by way of his or her links that one citation template leads to a slower page-loading time than the other, when the page has to be "re-rendered" by the server. A page has to be "re-rendered" when it is saved, to some extent when it is previewed (editing by section ameloriates this), and from my experience with Eubulides' links, when an old diff is accessed (possibly, again, only when as a logged-on user). A reduction in HTML page size using new templates has also been demonstrated, and while this is of course beneficial, differences in HTML size have little to do with the slow-editing problem that started this discussion—the issue is server-side. Yes, we are talking about an issue that affects editors, not readers (smaller HTML pages would actually benefit the average IP reader more). SlimVirgin makes a case that long pages with many citations are hard to edit; looking back, I agree, and now I understand why they were. There is evidence that changing how a citation template is "programmed" significantly improves an editor's wait times when previewing and saving heavily cited pages. In ignoring the evidence presented while speaking of intuition and others' "fixations", Malleus appears as the one with the fixation.

All things being equal, and without regard to what citation templates are used, why not improve the citation template "programming" to improve page rendering? Eubulides has already demonstrated that it can be done. I don't like the Vancouver system—finding it pointlessly terse and unformatted when space is not a consideration here—but whatever makes the vcite template faster to render—can it not be ported to the citation templates already in use? Whether that means removing parameters that add to programmatic and editorial complexity, removing "coins" metadata, or other technical differences between "cite" and "vcite" that haven't been determined yet, or mentioned here, it seems like a way forward. (I certainly carry a bias that "debulking" Wikipedia is never a bad thing—procedurally, technically, etc. Almost everyone has a pet project that bulks Wikipedia up in areas where we don't need it, and not in content, where we need it. That's the slow death of any organization.) Outrigger (talk) 02:37, 7 February 2010 (UTC)

That precis is a good one; thanks. The standard citation templates could be sped up quite a bit as follows:
  • Drop COiNS. Hardly anybody uses it, and the performance penalty hurts everybody (particularly editors).
  • Drop |first1=, |last1=, etc. Just use |author=.
  • Drop the unnecessary and repetitive wikilinks to ISBN, DOI, etc.
The result still wouldn't be as fast as {{vcite book}} etc., but it'd be a good sight faster than what's there now, and the resulting HTML would be within shouting distance of the size generated by {{vcite book}}. Eubulides (talk) 02:49, 7 February 2010 (UTC)
I don't particularly care about other article editors, so long as you see slower performance than me. :) Seriously though, I'd be more concerned with excluding dialup editors than the general problem of an article that totally pooches out no matter who tries to edit it. Eub, you've done some tests that seem to demonstrate the case empirically and quite well. Using {{vcite}} rather than {{cite}} seems to work well. However you also propose a concomitant loss of functionality, and that is what should be examined. If microformats should be discarded, shouldn't that be a standalone proposal? I recall jumping on someone once for breaking microformat data, so I'd be interested in the outcome. Dropping extra parameters saves you time on the first pass through the loop, I think you or I could verifiably demonstrate a savings of several milliseconds but not likely more. Repetitive wikilinks will take more time "on-the-wire", lets say an extra 200 bytes per link, times 200 links - that's about ten seconds on dialup so it becomes significant. Who suggested going to ask the serious pros above, might be a good idea? I do believe though that it is not so much about using a template repeatedly as it is how complex the template is in terms of its expansion (remember that all a template is, is a piece of text that I keep reading over and over again, writing in more stuff, until it doesn't have squigglies anymore, then I parse it as wiki-text). I suppose now I have to actually do that analysis rather than yapping about it. :( Performance gains generally come from either faster computers and networks (we wish); divesting functionality (always easy, except for those who relied on it); or innovations in the code which preserve functions (in which case, pls demonstrate). I'll note that while I'm interested in this discussion, I don't want to get in the way of you guys pursuing your solution, so please only include me if I'm actually helping. I don't want to get in the way. Franamax (talk) 03:45, 7 February 2010 (UTC)

Subpage ?[edit]

The contention has been verified by taking one slow-to-load page (namely Israel), replacing {{cite book}} etc. with {{vcite book}} etc., and noting that this decreaded page-load times by more than a factor of two; see version CITE vs. version VCITE. In the benchmark noted in Wikipedia:Centralized discussion/Wikipedia Citation Style #Demo of specific proposal are comments about a benchmark in which a page with standard citation templates takes 6.6× longer to generate than a page with citations formatted by hand (and 2.4× longer than a page with {{vcite book}} etc.). There really is no doubt that most of the slowness is due to the citation templates. If you're not yet convinced, I can easily do another test based on Canada. Eubulides (talk) 23:36, 6 February 2010 (UTC)