Jump to content

Talk:Usage share of operating systems

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Wikiolap (talk | contribs) at 17:59, 13 November 2011 (Undue weight: Looks OK to me). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

WikiProject iconTechnology Unassessed
WikiProject iconThis article is within the scope of WikiProject Technology, a collaborative effort to improve the coverage of technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
???This article has not yet received a rating on Wikipedia's content assessment scale.
WikiProject iconComputing Unassessed
WikiProject iconThis article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
???This article has not yet received a rating on Wikipedia's content assessment scale.
???This article has not yet received a rating on the project's importance scale.

Linux share: Caitlyn Martin's blog piece

Is this [1] a credible secondary source? It seems to me to be an exercise in wishful thinking. It seems to be clutching at straws. As a Linux enthusiast myself I've tried to follow her argument but it doesn't stack up IMHO. She says "The best estimate for present sales is around 8%", but she doesn't cite a source for this estimate, and in any case, present sales is a very different thing from the total installed base, bought over several years, that makes up usage share.

I can quite accept that the web client stats under-measure Linux a bit, mainly because Linux users are relatively security and privacy conscious and thus more likely to disable javascript, install adblock etc., all things which reduce counting on the third-party stats sites. It's interesting that Wikimedia's figures, based on server log files and thus immune to this hazard, show a somewhat higher figure (1.57%) than most of the others.--Harumphy (talk) 14:00, 4 December 2010 (UTC)[reply]

Agreed, but still not 8%. I tried to work that 8% figure in somewhere too, but the jump from 'installed user-base' to 'current sales' seemed too sharp for a short addition to existing text. The only way would be to devote a whole couple of sentences to it somewhere, and I'm not sure if she is notable enough for that. O'Reilly is a good source, but I'm not sure of her status to be speaking for them. --Nigelj (talk) 15:10, 4 December 2010 (UTC)[reply]
The only cited source there is quote from Steve Ballmer where he says that internal Microsoft research showed Linux and MacOS shares comparable. We already have this source covered. The blog doesn't seem notable enough to include in the article.Wikiolap (talk) 20:03, 4 December 2010 (UTC)[reply]

Should we remove the Wikimedia web client statistics?

The article currently states: "All of these sources monitor a substantial number of web sites. Statistics that relate to a single web site are excluded." To a large extent, this is not true for Wikimedia, of which Wikipedia alone is by far their most trafficked web site (although one that most English-language Web users have visited).

Also note that the Wikimedia report is based on the total number of HTTP requests rather than the number of unique clients (as determined using cookies). We need to consider the merits of the two approaches and which is more accurate. The Wikimedia report could easily be biased toward those operating systems used by those who access Wikipedia more often (although the others could be influenced by how much of each browser's user base regularly clears cookies). On these two principles, should we exclude the Wikimedia statistics? PleaseStand (talk) 00:48, 14 December 2010 (UTC)[reply]

Wikimedia's stats cover 60-odd sites within the Wikimedia family.[2] While this is much less 'substantial' than many of the other sources, it's much greater than 'one', the avoidance of which (specifically w3schools) was the original purpose of that sentence. (From time to time we get people trying to add w3schools' stats to the table, or suggesting that we should on this discussion page. Often they seem to be unaware that that site's stats are for its own site only, and that that site is aimed at web developers - a highly atypical readership with a much more diverse set of web clients than the general web-using population.) Also, the Wikipedias are very high-traffic sites. The English one disproportionately so, granted, but there are similar regional/linguistic skews in many of the other stats sources too. So I wouldn't exclude Wikimedia stats on the grounds that they monitor an insubstantial number of sites.
AFAIK there's no evidence to suggest that certain operating systems are used by those who access WP more often. Just as there's no evidence that certain OS's are used more by those who clear cookies, block scripts, use adblock etc. I imagine many of us have our suspicions in this regard, but no actual evidence. And if we had such evidence, the magnitude of the biases they introduce may be no larger than many of the other biases we already know about and to which all the sources are prone. So I don't think there's a case for excluding Wikimedia stats here, either. Harumphy (talk) 11:05, 14 December 2010 (UTC)[reply]

 Northern Ontario Jacob12190 (talk) 11:06, 14 December 2010 (UTC)[reply]

Web client table tweaks - January 2011

I propose to make a couple of minor tweaks to the table when the December figures come out in the new year, unless there are objections here first:

  1. For the Clicky desktop/mobile 'in lieu' split, take the mean of the Net Market Share and Statcounter figures instead of just using Statcounter. (This will probably have the effect of reducing Clicky's mobile share from around 4.1% to around 3.6%.) The footnote will explain what has been done.
  2. Android is rising rapidly and within a few months may overtake what we currently call 'mainstream' Linux. I propose to change the "mainstream" sub-heading to "desktop distros.". Harumphy (talk) 14:56, 29 December 2010 (UTC)[reply]
2) Oppose. There are several other mobile Linux distributions such as Maemo currently included within Mainstream Linux.1exec1 (talk) 00:24, 31 December 2010 (UTC)[reply]
Fair enough. I've done #1 but not #2 in today's update.Harumphy (talk) 11:09, 1 January 2011 (UTC)[reply]

Should we remove AT Internet Institute from web client stats?

We're seeing constant changes in the data, month by month (summary of each month here). With lack of frequent updates by ATII, we're ending up with less accurate results... 195.23.92.1 (talk) 18:18, 7 January 2011 (UTC)[reply]

Support 1exec1 (talk) 00:55, 8 January 2011 (UTC)[reply]
Be consistent - if we are to remove sources which don't update every month - then remove all of them, i.e. including Wikipedia one. If Wikipedia stays then ATII should also stay. Wikiolap (talk) 17:12, 8 January 2011 (UTC)[reply]
Remove - ATII is consistently slow at updating their stats. Remove for January's stats, unless they update. As for Wikimedia, we have a little more control over it. I've emailed the person updating the stats in the past, I think we just need to have him set up something more automated, because I think he has to manually run his scripts. Or talk to someone else that has access to the logs and can give them to us. Jdm64 (talk) 18:57, 8 January 2011 (UTC)[reply]
Keep. Last time we discussed this (see archive) we decided to keep stats for up to 12 months. It used to say as much in the web client section. (Somebody, unaware of the discussion and seeing that all the stats at the time were more recent, changed the "12 months" to "few months". I've just changed that back.) If 12 months is too long a period, then we should reduce that period. Whatever we do, we should apply the same time limit to all the sources.Harumphy (talk) 19:21, 8 January 2011 (UTC)[reply]
I, the guy who raised the issue in the first place, agree with this item, them. I didn't know nor found out anywhere that the discussion was held in the past and that "12 months" was decided. Since it was, let's just abide to the decision. 195.23.92.1 (talk) 14:09, 10 January 2011 (UTC)[reply]
Hmm, I haven't found any discussion in the archives. The problem is that the software market evolves very rapidly. Most of the time initial adoption of some product increases exponentially, by allowing 12 month delay we face data errors of more than 10% [3]. For example now AT Institute data is different from the median/mean data in Windows columns by huge margins (W7 - ~7%, Vista - ~4%, XP - ~11%). If we reduce the allowed delay to, say 6 months or less, we can lower the possible data errors more than two times.1exec1 (talk) 12:10, 21 January 2011 (UTC)[reply]
The 12 month precedent came up in Talk:Usage_share_of_operating_systems/Archive_1#web_clients_summary_table, in a brief discussion about what to do with OneStat data from Dec 8, 2008 that was getting on for one year old at the time. Jdm64 suggested "less than one year old" and I agreed - nobody else took part in the discussion. We removed OneStat on Dec 9, 2009. For a long time afterwards the article mentioned 12 months, which I recently reinstated.
As far as the time limit goes, I don't think it matters about AT being 'out of date' because (a) it's an encyclopedia, not a news site, so up-to-the-minute topicality is not essential, and (b) our choice of median rather than mean does a good job of excluding outlier figures. Harumphy (talk) 15:04, 21 January 2011 (UTC)[reply]

Median Windows Numbers

I've been playing with the Median numbers, which I had intended to quote in an article I was writing. I'm not quoting them. The numbers do NOT add up. No matter what I've tried, I cannot get those numbers of make any sense, and since there is no explanation of the calculations used to determine the Median, the only conclusion I can draw is that the numbers were either invented, or are in error. So instead of reporting your numbers, I'm reporting them, and my conclusion that this article is in error. If you want to look at the article, which is my prediction for where OS usage shares will be in 2012, it will be at: http://madhatter.ca

One other point - Netbooks should be included as Notebooks, and Tablets should be in a separate category, due to the form factor. Tablets have more in common with phones than they do with notebooks.

UrbanTerrorist (talk) 20:01, 8 January 2011 (UTC)[reply]

The medians are calculated just like any other median, surely? In each column, the median is the middle value of the group. Thus the median of 1, 2 and 999 is 2. Where there is an even number, it's the mean of the middle two, so the median of 1,2,3 and 999 would be 2.5.) Are you saying that our table doesn't do this? If not, what precisely do you think does not add up?Harumphy (talk) 20:08, 8 January 2011 (UTC)[reply]
Are they calculated like any other median? Who knows? There is no explanation as to the method(s) used, and no reference to an explanation. In effect we are given numbers, and told to believe them, which is against the policies of Wikipedia. Either provide an explanation, or remove the median figures.
UrbanTerrorist (talk) 14:49, 11 January 2011 (UTC)[reply]
The Median label links to Median article which explains how medians are calculated. Do you consider this is not enough ? Do you propose adding a footnote with more details on it ? Wikiolap (talk) 18:51, 11 January 2011 (UTC)[reply]
The term 'median' has a precise meaning and there's only one way of calculating it. Anyone who wants to can calculate the median herself and get the same figure. The problem here is not the article, but your inability to click on the link to the median page to find out what it means. Harumphy (talk) 13:58, 17 January 2011 (UTC)[reply]
A link to the explanation would make things clearer. UrbanTerrorist (talk) 07:51, 27 January 2011 (UTC)[reply]
And the very common problem of confusing median with mean. or believing they share some properties that they do not. Like adding up to 100 % (under certain circumstances). The different properties are not necessarily easy to understand, and the article is quite heavy for somebody not used to mathematical theory. --LPfi (talk) 10:49, 18 January 2011 (UTC)[reply]
Agreed. Which is why an explanation, or a link to an explanation is needed. UrbanTerrorist (talk) 07:51, 27 January 2011 (UTC)[reply]
I should mention that I'm using the numbers for some articles that I'm writing, and I want to make sure that they are as accurate as possible, thus the questions. And yes, I link back to the source.

What gives with server market share BY REVENUE.

Revenue doesn't measure server market share, it measures how much money the supplier of one type of server rakes in. It also measures how much purchaser have had to pay for servers of that type ... both of these are the same number. Becuase some types of server cost far more than others, the metric skews the perception of market share towards which servers cost the most. These are perceived as having a greater market share, even though they may have relatively small numbers.

Furthermore, there are far more purchasers of servers than there are suppliers. Ergo, from the overwhelmingly predominant perspective, it is better to label this number as COST rather than revenue. Far more people would see it as a cost as opposed to those who see it as revenue.

So, I will keep trying to change the word "Revenue" within the server market share section to read "Cost" instead, until it sticks, because this wording gives the proper perspective on it from the vast majority viewpoint.

Alternatively, one could remove the "Revenue" metrics entirely, because they simply do not show server market share as they purport to.—Preceding unsigned comment added by 118.210.63.179 (talk) 10:46, 9 January 2011

Please sign your comments - otherwise we won't know who said what. Please see Wikipedia:Signatures.
Before you 'keep trying to change the word "Revenue" etc.' please read Wikipedia:Edit_warring.
As you've said, revenue and cost refer to the same number. It's the same sum of money seen from the two sides of the deal. The sources we're citing measure sales, not purchases. So revenue is the more accurate term in this context. Harumphy (talk) 13:59, 9 January 2011 (UTC)[reply]
Why include market share by revenue figures at all? They're useful for stock investors (i.e. which OS is generating the most revenue from a given market), but they will deceive casual readers expecting to learn about the "Usage share of operating systems on servers". Wallers (talk) 14:16, 9 January 2011 (UTC)[reply]
IDC and Gartner are well known sources in the server industry. They report in revenue probably because business people are more interested in the money instead of the number of units -- it's what their use to. Also it's easier to measure because you can't just check what OS a server is running like you can with desktop web browsers. A server could be hosting several virtual servers (upwards to 15) and each virtual server can have it's own ip address. So without detailed investigation, you'd see 15 servers when in reality there's only one real server. Henceforth, the real number of servers (as opposed to number of installs of an OS) is more closely correlated to revenue. Although, it gets complicated because the licensing of Linux servers can be free if the company goes with a distro without 24/7 support (like debian or centos) or it could be costly if paying for a full subscription of RHEL. So, even-though Linux is low on revenue, it's still really high in actual usage. Jdm64 (talk) 18:20, 9 January 2011 (UTC)[reply]
Jdm64 wrote it exactly right. There are different metrics to measure market share, and they indeed measure different things. Market share by units is interesting to know who which server OS is most popular. Market share by revenue is interesting to see which server OS vendor is making most money. So there is no contradiction - both are interesting, both are useful, just for different purposes. Wikiolap (talk) 18:11, 10 January 2011 (UTC)[reply]
"Making the most money" also means "getting the most money out of people for less cost to yourself". "Revenue" is a word with positive connotations in people's mind, whereas "Cost" has negative connotations. "Revenue" and "cost" are the same number ... what is revenue to sellers of servers is cost to buyers of servers. Since there are far more buyers than sellers, it is in the best interests of more people to show market share from the perspective of buyers rather than sellers (that is, to show it as cost rather than revenue). A casual reader might see the OS with the highest revenue and without thinking about it associate that positive term with "the best choice", when in fact it is the most costlly to him/her. From this perspective, citing statistics for sale value(price) of servers, labelling it as positive-sounding term revenue, and claiming that this shows "market share" is doing a sever dis-service to most people. In fact, it comes perilously close to free advertising for one company, which I would have thought goes against Wikipedia policy.—Preceding unsigned comment added by 118.210.63.179 (talk) 14:08, 11 January 2011
I think you are stretching the definition of the advertising a bit :) Your thinking about positive vs. negative association is interesting, but it is your opinion only. In encyclopedia we cite verifiable and reliable sources - and both Gartner and IDC qualify as such. They label the metric Revenue, and we must respect that.Wikiolap (talk) 18:54, 11 January 2011 (UTC)[reply]

IDC also report units, so there's no reason to use revenue, and I've changed the numbers to the unit rather than revenue figures. For the Gartner figures, I checked the source, and they appear to be unit figures as well, not revenue figures.Shalineth (talk) 06:36, 10 February 2011 (UTC)[reply]

Can you show where in the source the reported percentages are refered as units ? In the table that we cite, the column headers say Revenue. Wikiolap (talk) 20:53, 10 February 2011 (UTC)[reply]
The Gartner source is a three-year-old Reuters article. The text in the article reads:
According to research firm Gartner, the Windows share of global server shipments gained a percentage point to 66.8 percent in 2007 from a year earlier. Open-source Linux's share fell by a percentage point to 23.2 percent last year and Unix dropped to 6.8 percent in 2007 from 8.1 percent in 2006.
Note that this refers to the share of global server shipments, i.e. units, not to the share of global server revenue. The figures are also similar to IDC unit figures from about the same time, but very different from IDC revenue figures. It was only in 2005/6 or so that Windows severs overtook Unix servers in terms of revenue, but Windows has been ahead of Unix in unit shipments since the 1990s.
The IDC figures in the table are indeed revenue for server hardware (not revenue for operating systems), but it is a methodological error to use this as an indicator of server OS 'usage share'. What possible sense is there in saying that one server costing €20 000 and running one instance of AIX contributes 40 times as much to the AIX 'usage share' as one server costing €500 and running one instance of Linux or Windows?
I had provided a source for IDC unit shipments and corrected the table to include them, but this was reverted.
In a comparison of server revenue or server profitability, prices matter. If HP, for example, are selling a lot of €50 000 servers and Dell are selling a lot of €1 000 servers, that has a huge impact on their respective results. If each server runs only one copy of an OS, however, the 'usage share' is 1 for each server. The idea that multiplying server operating system units by the cost of the hardware the OS runs on somehow represents 'usage share' is completely nonsensical. Shalineth (talk) 16:28, 12 February 2011 (UTC)[reply]

Overview section

Recently somebody added the overview table, and various editors including me attempted to knock it into shape. I don't think we've been very successful. I can't see where many of the figures come from. The web client medians don't (and shouldn't be expected to) add up to 100% and so don't constitute 'share' anyway. Should we delete this section, or can it be improved? Harumphy (talk) 08:55, 17 January 2011 (UTC)[reply]

I'd lean for deleting it. Many of the fields are blank because some of the selected OSs come from disjoint usage (ie. mainframe and smartphones). Although, the one thing going for the table is the quick summary. I'd purpose that key stats from the table to be included in the opening paragraph. Something to elaborate on the current opening, but with some actual numbers. Jdm64 (talk) 10:32, 17 January 2011 (UTC)[reply]
Does anyone want to speak in the overview section's defence? If not I'll delete it in a couple of days from now. As far as putting figures in the opening paragraph goes, ideally that would be done in a way that doesn't need updating every month. Harumphy (talk) 09:12, 18 January 2011 (UTC)[reply]
I vote to delete it, the idea of this overview table never appealed to me, and it doesn't look like it is helping the article. Wikiolap (talk) 17:26, 18 January 2011 (UTC)[reply]
Now deleted as agreed. Harumphy (talk) 10:14, 21 January 2011 (UTC)[reply]

Tablets

Tablets need to be moved to their own section. They have nothing in common with Netbooks, though they may replace them in some sales. Netbooks should be combined with Notebooks, the difference between them is artificial.

UrbanTerrorist (talk) 07:55, 27 January 2011 (UTC)[reply]

Yes. A netbook is just a small laptop/notebook. I've changed the sub-heading "Netbooks and Tablets" to just "Netbooks". There's very little info on tablets as a category so far. The main one is the iPad which runs iOS and this gets covered anyway. Do all tablets speak mobile, or are some them WiFi only?--Harumphy (talk) 09:20, 27 January 2011 (UTC)[reply]
Sorry Harumphy, I've been busy. Please see the note at the bottom under Long Term Suggestions. UrbanTerrorist (talk) 03:35, 11 August 2011 (UTC)[reply]

Font sizes

I've reverted 1exec1's changes to a couple of tables, in which the font size was fixed at 85% of normal.

Please, if font sizes look too big on your computer, it doesn't mean they look too big on everyone else's. The web is not a wysiwyg medium. You can adjust your browser's normal font size to suit your preferences. I have adjusted mine, and I don't want you reducing the font size on my computer (and everyone else's) just because it looks better on yours!

Besides, from a graphic design point of view it looked awful. --Harumphy (talk) 09:37, 9 February 2011 (UTC)[reply]

Possibly dubious server share claims based on websites

Are there any authoritative sources suggesting that scanning public websites is a reasonable way to estimate server market share? It seems rather dubious to me. For one thing, Linux is well known as a good OS for running web servers (the LAMP stack), so a sample of web servers may not be representative of servers in general, but rather biased towards Linux.

Another problem is that a single server OS can host a large number of small websites, whereas a large website may require several servers, especially if it makes heavy use of SSL. If website characteristics differ across sites, then an estimate of server market share based on the number of websites would be biased towards the system most favoured by smaller, less complex sites. Netcraft report a 50 per cent Windows share for SSL websites (http://news.netcraft.com/ssl-survey/), compared with a 25 per cent share for non-SSL websites, which suggests that estimates based on websites may indeed be biased towards Linux.

Unless an authoritative source suggesting that counting the number of public websites using a particular server OS is a valid way of estimating server OS market share, this looks like original research, and I suggest it be deleted. A separate section on web server OS share might be reasonable. Shalineth (talk) 06:55, 10 February 2011 (UTC)[reply]

Netcraft is reliable and verifiable source, and we properly reference it. They choose to analyze OS share of web servers, and we accurately mention it. Hence this is not original research. Some readers may or may not agree with their methodology, but it is really not up to us to make judgments whether or not we like it. As encyclopedia we report citing reliable and verifiable sources. Wikiolap (talk) 20:51, 10 February 2011 (UTC)[reply]
Netcraft present website statistics, not server OS market share. You're confusing two different things, and that's where the original research lies. It's like looking at valid statistics for flights out of a particular airport and claiming that represents airliner market share.Shalineth (talk) 11:13, 12 February 2011 (UTC)[reply]

Marketshare Servers based on websites is totally misleading

The current server share language is complete nonsense. It somehow asserts that webs servers are a good indicator of server share. That is utter nonsense. The vast majority of servers are not web servers.

There has been edits made to temper the uncited claims in this section, but they have been reverted.

It's clear that this section is merely a +POV apology for certain products. It's clear that wikipedia is being used as a promotional tool for certain products. —Preceding unsigned comment added by 173.206.8.177 (talk) 18:59, 11 February 2011 (UTC)[reply]

There are two issues here:
1. The explanations in the Server section are indeed unreferenced, and this invites people to add even more unreferenced material. I am OK with completely removing the text, but the past experience shows that someone will add it again anyway. The better approach is to find reliable and verifiable references for that portion (something I wasn't able to easily find myself).
2. Methodology of measuring market share. We as encyclopedia should not pass judgement on whether some methodologies are "complete and utter nonsense" or not. Some people claim that measuring revenue is nonsense, some claim that measuring web servers is nonsense. Everybody is free to make their opinions - but we as encyclopedia just report on what our sources say - Gartner, IDC, Netcraft etc.
Wikiolap (talk) 20:12, 11 February 2011 (UTC)[reply]
Re: #2; "We as encyclopedia should not pass judgement on whether some methodologies are "complete and utter nonsense" or not." I agree. But in this context, presenting only publicly available webservers -- in a conversation about Server OS Marketshare **is**.
The discussion re: web server marketshare is irrelevant here, in this context. All the uncited material from the above paragraph and the data in the "method units (web)" table should be removed. This is intentionally misleading and utter nonsense to compare oranges to bananas in way.
173.206.8.177 (talk) 20:29, 11 February 2011 (UTC)[reply]
Could you tell why exactly you want to remove web units data?1exec1 (talk) 22:56, 11 February 2011 (UTC)[reply]
We have had, at various times, three different methods of counting servers in this section: unit sales/revenue/web servers. All three have strengths and weaknesses - there is no clear-cut right and wrong here. We should just report all three, perhaps in three separate tables, pointing out the strengths and weaknesses of the methods too.--Harumphy (talk) 00:00, 12 February 2011 (UTC)[reply]
Reporting all three in separate tables sounds like a good first step. Based on the source, however, the Gartner figures are units, not revenue, so that's wrong to start with (I corrected the mistake, but someone reverted it with no explanation). Reporting server market share in both units and revenue would be the best idea.
Website share should be split out into a separate category, since it's a completely different issue from server OS market share. The text is also horrible, and should probably be deleted. I made some minor improvements to make it less POV, but those were reverted too.
If you oppose splitting the website figures into a separate category, can you point to an authoritative source that claims Netcraft's website survey has anything at all to do with server market share?
Overall, it's obvious someone is abusing the article to promote a particular POV. I'm not really interested enough to bother with it, but maybe someone with more time on their hands can correct this. If not, I suppose it'll be another case where the reputation of WikiPedia is damaged by a zealot pushing a particular POV and reverting any corrections or attempts to make the text NPOV. Shalineth (talk) 11:32, 12 February 2011 (UTC)[reply]
Web servers are a subset of servers-in-general, so I think the best thing to do would be to do units and revenue in two tables, then add a sub-heading "Web servers" with the third table in that new sub-section.--Harumphy (talk) 13:45, 12 February 2011 (UTC)[reply]
Yes, certainly, but web sites are not the same as web servers. I imagine it takes an enormous number of servers to run www.facebook.com, for example, whereas even a small server could run hundreds of very simple sites. The fact that web servers are a subset of total servers is a minor problem. The bigger problem is that there is no one to one correspondence between web sites and web servers, much less between web sites and either servers generally or server OS installations. This means that, barring authoritative evidence to the contrary, web site numbers cannot be considered valid estimators of even web server OS market share, much less overall server OS market share. Shalineth (talk) 14:50, 12 February 2011 (UTC)[reply]
3. The definition of "server" is broad. Web server market share might be estimated via the web while giving numbers for File server market share (down to NAS) via web is a challenge. --95.117.233.197 (talk) 13:59, 12 February 2011 (UTC)[reply]
4. The conjecture that IDC or Gartner figures substantially underestimate Linux or open source servers is logically unsound. As documented here, IDC unit figures for server shipments include Windows, Linux, Unix and other. Servers sold with no operating system would thus fall into the other category. However such servers make up only about 0.3% of the total (for Q1 2010). This implies two things:
1. The Windows and Unix market shares, 75.3% and 3.6% respectively in Q1 2010, are minimum market share levels, and do not overstate market shares for shipped servers.
2. The Linux market share is not substantially understated. Even if Linux is installed on every single server that didn't ship with either Windows or Unix, its Q1 2010 market share would only increase from 20.8% to 21.1%.
In light of the above, I suggest that the unsupported conjecture that IDC numbers understate open source server operating systems be deleted from the article, unless authoritative evidence to the contrary is provided. Shalineth (talk) 14:50, 12 February 2011 (UTC)[reply]

Suggestions for correcting server market share section

I propose the following corrections to the server market share section:

  1. Remove unsupported text claiming that IDC/Gartner figures understate open source OS share.
  2. Remove irrelevant web site share figures for possible inclusion in a separate section on website OS shares.
  3. Correct labelling of Gartner unit figures, which are currently mislabelled as revenue figures.
  4. Replace methodologically incorrect IDC server hardware revenue figures with methodologically correct server unit figures.

I probably shan't have time to check the page before next weekend. Comments appreciated. Shalineth (talk) 16:35, 12 February 2011 (UTC)[reply]

1 - I support removing all unreferenced claims.
2 - measuring market share of web sites is a valid method that at least 3 different sources use (Netcraft, securityspace, w3tech) - we should not remove legitimate reliable and verifiable sources. We already have some text which tries to clarify difference between methodologies. Maybe this text could be improved, but it should not have unreferenced claims either (see #1)
3 - Gartner reports revenue. The source is reliable and verifiable, but not public - the report itself costs money. I had access to it couple of years ago, I will try to get access again and verify that it is indeed revenue.
4 - IDC reported market share by revenue, and it is perfectly valid methodology (IDC is reliable and verifiable source). I used to have additional line in the table for IDC numbers by unit, but it was removed by other editors. I will be happy to add it back.
Wikiolap (talk) 00:46, 13 February 2011 (UTC)[reply]
Remove unreferenced claims. Report web site share in a new section, separate from the server section. Report both units and revenue in separate tables with correct labelling, even if there's only one cited source.--Harumphy (talk) 11:08, 13 February 2011 (UTC)[reply]
If we separate website and server-share reports we better do not include the website share at all. I suggest reordering the current table in the way that sources reporting website share are grouped together. We can also introduce one more column that says which method was used to acquire the statistics. Also see my answer below. 1exec1 (talk) 15:20, 13 February 2011 (UTC)[reply]
I disagree. The article already has a separate section for 'Web clients', which is distinct from the sections for 'Desktop and laptop computers', 'Netbooks' and 'Mobile devices'. The consistent approach for servers would be to have a section for 'Web sites', which is distinct from 'Servers'. Shalineth (talk) 21:09, 21 February 2011 (UTC)[reply]
2. Measuring market share of web sites is a valid way of measuring web site share. This is an article about server OS usage. Is there an authoritative source claiming that measuring web site share is a valid way of measuring either web server share or web server OS share? If not, I suggest it belongs in its own section (or perhaps own article) -- an article about web site market share, as opposed to (web) server OS market share. Again, I must stress, these are not synonymous. It is a severe methodological error to assume they are. Shalineth (talk) 12:19, 13 February 2011 (UTC)[reply]
3,4. Gartner and IDC report both revenue and units, although not all reports contain both measures. Revenue is a valid measure for market share, which can be defined in terms of either revenue or units. This article is about usage share, which implies units. Second, the revenue figure is for servers, not server OSes. That would be fine in an article about server hardware market share, but this is an article about server OS usage share. Again, the figures are absolutely valid, but they're being used incorrectly in this article. Shalineth (talk) 12:19, 13 February 2011 (UTC)[reply]
@Shalineth: Website share is a proxy to the actual server OS usage share in the same way as inspecting user agent strings is a proxy to desktop OS market share. If you consider them not appropriate, then sources reporting server market/unit share are not appropriate reference points either, as they report the current sales, not the share of already deployed servers.
In conclusion all sources used in the article are biased in one or another way. Since we are only presenting and commenting the data, not interpreting it, all sources must have the same credibility, unless there is a strong reason not to do so.1exec1 (talk) 15:20, 13 February 2011 (UTC)[reply]
This is true. Sales by hardware units and sales by hardware revenue are also proxies for OS usage share. None of the three methods correlates directly with OS usage share, but all three are of interest nevertheless. We should just report what the sources say, accompanied a concise summary of the strengths and weaknesses of each method. It is for the reader to decide how much credence to give to each method, not us. --Harumphy (talk) 09:57, 14 February 2011 (UTC)[reply]
@ 1exec1
It isn't quite the same thing, since there's usually a 1:1 mapping of web clients to client OSes. For web servers, a single server can run a huge number of websites, and at the other extreme, some websites require large server farms. All this means that the approximation is much closer on the client side. In any case, I think it's perfectly reasonable to include web site OS share, as long as it's properly labelled as 'Web site OS usage' and not conflated by original research with 'server OS usage'.
The same applies to 'server OS unit shipments' and 'server hardware revenue'. It's fine to include them both, as long as it's made very clear what they are, and 'server hardware revenue' isn't mislabelled as 'server OS revenue' or 'server OS usage'. What actually brought this article to my attention in the first place was confused comments by Linux advocates who thought 'revenue' in this article referred to software vendor revenue, not to server hardware revenue, and were going on about how most users don't pay for Linux so revenue figures are invalid, etc. The section on server OSes is very unclear about these things, and looks like a clear case of misrepresentation of data (not necessarily intentional -- though the unreferenced comments suggest it is). The data are valid, but are being misused. Shalineth (talk) 21:09, 21 February 2011 (UTC)[reply]
This sounds like a consensus to me - we keep the valid data in the article, but relabel it to disambiguate what it actually means. I would support this effort.Wikiolap (talk) 23:51, 21 February 2011 (UTC)[reply]
It sounds like consensus to me too.--Harumphy (talk) 13:34, 27 February 2011 (UTC)[reply]

Time limit for out-of-date sources

There was some discussion earlier in Talk:Usage_share_of_operating_systems#Should_we_remove_AT_Internet_Institute_from_web_client_stats.3F. I think it's fair to say there's a consensus that we should apply the same time limit, whatever that limit is, to all the sources. At the moment it's 12 months. Someone suggested we should reduce it to 6 months. (If we did that then ATII would get removed on 1st April if they haven't updated by then, because they last updated on 31/9/2010.) So, should be cut the time limit to 6 months?--Harumphy (talk) 13:34, 27 February 2011 (UTC)[reply]

I think yes. The previous discussion was stopped by the fact, that Wikipedia doesn't update either. As the problem has since been solved, I know no reason to keep a single old source, that skews the data.1exec1 (talk) 17:11, 27 February 2011 (UTC)[reply]
6 months seems reasonable to me. Jdm64 (talk) 02:21, 28 February 2011 (UTC)[reply]
FYI ATII has just updated. They must have heard us!--Harumphy (talk) 16:02, 1 March 2011 (UTC)[reply]
Yes, and I'm extremely disappointed with them. As you can see with the more detailed PDF, they consider Android as the "Google Operating System" and as if not being Linux, providing unaccurate data for this table... 89.181.106.123 (talk) 00:29, 2 March 2011 (UTC)[reply]

Mobile Devices Citation

Caption on image currently reads "Share of 2010 Q4 smartphone sales to end users by operating system, according to Gartner", followed by a citation.
The numbers in the pie chart are not contained within the cited article. The cited article was written on 19 May 2010, and reports on 2010 Q1 numbers.
Caption should be revised to cite an article containing the numbers used on the pie chart, or the pie chart should be changed to reflect the numbers in the cited article. Mismatches are bad, mmmkay?
64.113.8.130 (talk) 22:55, 4 April 2011 (UTC)[reply]

Web clients - remove sources

Both AT Internet and StatOwl ignore mobile clients in their reports (well, in fact AT Internet notices the existence of iOS but doesn't consider Android worth counting, nor as a Linux "variant", StatOwl just ignores them). That makes the rest of the values inflated, so comparing the numbers from these two sources with the rest isn't a fair comparison. Thus, I propose for us to just stop taking into account both these sources, until they start reporting (or taking into account in their reports) the existence of mobile web clients. 195.23.131.230 (talk) 15:58, 12 April 2011 (UTC)[reply]

AFAICS AT Internet includes Android and a number of things under 'other'. This is perfectly OK for our purposes. StatOwl is more of a problem because they just take desktop OSes with above 0.1% share and expand the numbers so they add up to 100%. This is inconsistent with the rest of our table and there's no easy way of fixing it. So I think we should keep AT but I've no objection to removing StatOwl if that's where the consensus is.--Harumphy (talk) 07:17, 13 April 2011 (UTC)[reply]
AT Internet: The fact that AT Internet puts under "other" things that we don't makes our data on "other" and what fits in there for AT Internet and not for us erroneous. The only ways we're being correct about the data we're dealing with is either by removing AT Internet as a source, or putting things like Android also under other, like they do. So we actually have three different choices: 1) being wrong (as we are now), 2) removing one source (and thus removing the accuracy of the data we're presenting), or 3) putting Android under other, which I honestly don't like, since Android is technically Linux, so the numbers of "Linux" would be "some Linux", which would cause confusion... 195.2 width="100%"3.92.1 (talk) 16:07, 8 August 2011 (UTC)[reply]
StatOwl - I vote on removing StatOwl, since the fact that they don't have an "other" makes their data meaningful only in comparison between those OSs they have stats on. It might be interesting data, but it simply doesn't fit on what we're trying to represent in this table. 195.23.92.1 (talk) 16:07, 8 August 2011 (UTC)[reply]
I oppose removing StatOwl - they are valid reliable and verifiable source. We could add note explaining their methodology if more explanations is needed, but not to remove this source.Wikiolap (talk) 17:34, 13 April 2011 (UTC)[reply]
They are reliable and verifiable, yes, but they're not measuring the same thing we're representiong on that table. They represent the share between a list of OSes, while we're representing the share between all OSes (thus the "other" column). They don't give us enough data (an other column, for instance) to even find out what's the real percentage of those OSes they're representing, so their numbers, while interesting, simply don't have enough info to fit in our table. Putting them there, as they are nowadays, just adds known-yet-unmeasurable error into the table... 195.23.92.1 (talk) 16:07, 8 August 2011 (UTC)[reply]
I agree with your comment higher up this section about the ambiguity of our 'other' column. It isn't immediately obvious that what we count under 'other' varies from source to source. Rather than eliminate a source because it doesn't fit our idea of 'other', it would be better to eliminate the 'other' column from the table. There's nothing wrong with AT as a source. StatOwl, on the other hand, is more problematic because it only covers desktop OSes with >0.1% share and then expands them to fill 100%. So I think we should keep AT, dump StatOwl and dump the 'other' column. --Harumphy (talk) 22:10, 9 August 2011 (UTC)[reply]
I concur with the the idea of removing StatOwl. However, I don't think that dropping the 'other' column is a good idea unless all the rows add up to 100%. Since that column is defined as 'whatever doesn't fit to the current columns', or simply '100% - sum of the columns', it will be implied even if we dump it. So I don't see point in doing that. The abovementioned issue of AT not using the same 'other' definition as ours can be solved by merging the problematic cell into one for now.1exec1 (talk) 00:18, 11 August 2011 (UTC)[reply]
Sorry, I don't get that last bit. What do you mean by the "problematic cell" and what are you suggesting we merge it into? --Harumphy (talk) 10:28, 12 August 2011 (UTC)[reply]
I meant doing something like this:
Source Date Microsoft Windows Apple Linux kernel based Symbian Black-
Berry
OS
Other
7 Vista XP All
versions
Mac
OS X
iOS GNU/
Linux
Android
AT Internet [4] Apr. 2011 28.8% 16.4% 42.1% 88.4% 6.9% 2.8% 0.9% 0.5% 0.5%

1exec1 (talk) 09:30, 14 August 2011 (UTC)[reply]

Doing so is fine by me, as long as we do the same for the median... 195.23.92.1 (talk) 19:28, 17 August 2011 (UTC)[reply]
It seems a bit messy to me, especially if we do the same for the median. Overall I don't think it's an improvement.--Harumphy (talk) 10:01, 18 August 2011 (UTC)[reply]
Since dumping the "Other" column without adjusting the percentages would be odd (lines wouldn't add up to 100%) and this solution is messy, would you accept a solution where the "Other" column would be simplified (and we would add the Symbian and Blackberry values to the "other" column)? Feel free to add your comment and also your vote in the "Vote Count" section for this alternative 195.23.92.1 (talk) 14:21, 18 August 2011 (UTC)[reply]
If we're keeping the 'Other' column, I'm in favour of leaving it as it is. The fact that for one source it includes Symbian and Blackberry is a very minor irritation which I can live with more easily than any of the proposed remedies.--Harumphy (talk) 15:23, 18 August 2011 (UTC)[reply]
Shouldn't we at least put some kind of notice in the footnotes, then? Keeping it "as it is" results on both wrong data on "Other" (more than it should) and "Symbian" and "Blackberry" (less than it should). Another (while messy more agreeable to me) thing we could do would be doing the same it's done to Clicky and proposed to StatOwl, and "calculate" which percentage of that "Other" is "Symbian" and "Blackberry", taking the other sources as a reference. 195.23.92.1 (talk) 14:43, 19 August 2011 (UTC)[reply]
How about this: we replace '---' with 'n/a', change heading 'Other' to 'Other inc. n/a' and add a footnote saying n/a = data not available from source --Harumphy (talk) 21:37, 19 August 2011 (UTC)[reply]

StatOwl - further thoughts

Looking at the above discussion, I can see that I've blown hot and cold on StatOwl over the months. The only real problem I have with StatOwl is that it ignores mobiles and inflates desktop share to 100%. We have a related problem with Clicky Web Analytics, which produces separate stats for desktops and mobiles, which we multiply by roughly 0.94 and 0.06 respectively to get the figures for our combined table. (I work out the exact figure each month from the mean of two sources as explained in the footnote.) If we used the same correction for StatOwl - i.e. multiply its figures by the desktop factor of 0.94-ish, that would eliminate the inconsistency. Naturally this would have to be explained in the footnote. I've suggested this in the past, but not got support for it, so for many months the status quo has been that applying this kind of correction based on figures from two other sources is somehow OK for Clicky but not OK for StatOwl. It seems to me that we should be consistent here, so please comment here and see the further vote option below.--Harumphy (talk) 08:10, 19 August 2011 (UTC)[reply]

If you do so, you won't have data to fill in the blanks: if you put the rest of the percentage to reach 100% in "Other", then that data is wrong (because you have Android, iOS, Blackberry and Symbian's percentages there), if you also don't fill "Other", then the table will be... strange looking, even if more accurate. From these two choices I still prefer the third (ditch StatOwl), but if that's not what will happen... which of the two sollutions up there do you propose? 195.23.92.1 (talk) 14:35, 19 August 2011 (UTC)[reply]
I think the other 6% or so should go in the 'other' column. The footnote explains how the 'other' column is calculated so I don't have a problem with it.--Harumphy (talk) 21:24, 19 August 2011 (UTC)[reply]
Three in favour, none against, so I've implemented this. --Harumphy (talk) 14:40, 24 August 2011 (UTC)[reply]

Vote Count

We're discussing two sources in particular, and they have different suggestions... Here's a summary of the votes we can see by reading the discussion (please update this if you add something to the discussion):

StatOwl - Let's remove it!

AT Internet - let's merge the "Others" cells!

AT Internet - let's count "Symbian" and "Blackberry" as "other"!

Apply desktop/mobile split (mean of Net Applications and StatCounter Global Stats figures) to both Clicky and StatOwl

  • Yes, apply to both - 2 vote - Harumphy, Jdm64, 195.23.92.1
  • No, apply to neither - 0 votes
  • Apply to Clicky but not StatOwl (status quo) - 0 votes

Linux table headings

For clarity and consistency between sections, I suggest we change the top-level heading in both the web client and mobile device tables from "Linux" and "Linux based" respectively to "Linux kernel based", and change the second-level heading in the web client table from "mainstream" to "Linux". --Harumphy (talk) 19:34, 11 May 2011 (UTC)[reply]

I don't think that's the best solution. For one "Linux kernel based" is a long title. Second, I think it would be confusing. What's the difference between "Linux" and "Linux kernel base"? I understand what you're trying to say, but would others? I think it's fine how it is, or possibly, "Linux" as the top heading (or "Linux based") and then sub-headings of "GNU/Linux" and "Android/Linux". Jdm64 (talk) 22:14, 11 May 2011 (UTC)[reply]
Linux has two meanings: (1) the Linux kernel, and (2) the family of operating systems based around it, which are largely binary compatible with each other and traditionally known as Linux distributions. Then there is Android, which uses a forked Linux kernel, is binary incompatible with Linux distributions and has a stack sitting on the kernel which is very different from anything else. The only thing that Android has in common with Linux distributions is the kernel, and that is a heavily modified, incompatible derivative. I am aiming to better reflect the two meanings, and to deal with the fact that within a couple of months or so it looks as though Android will be more mainstream than the stuff we currently call "mainstream". As far as length goes, "Linux kernel based" will fit without expanding column width. (I've tried it.) I don't thing we should use GNU/Linux or Android/Linux as they really are too long, don't reflect what the sources say and do not aid understanding at all. --Harumphy (talk) 08:05, 12 May 2011 (UTC)[reply]
I am more confused by "Linux kernel based" vs "Linux based" as they may be understood as synonyms and anything Linux based is certainly Linux kernel based. We must of course use terminology that reflects what the sources are talking about, but isn't most Linux except Android indeed GNU/Linux (which is not longer than "Linux based")? If there is significant use of other Linuces (affecting the decimal points we are writing out) simply "Android" and "Other Linux" should do. --LPfi (talk) 11:51, 12 May 2011 (UTC)[reply]

[section break]
Just to be clear, I'm suggesting this:

Linux kernel based
Linux Android

The top line is an umbrella heading that accurately reflects the only thing that Linux distributions and Android have in common: some sort of Linux kernel. In the second line, Linux means what it is most commonly understood to mean - a Linux distribution. In this I'm taking the view that Android is *not* a Linux distribution in the conventional sense because it has so little in common with Debian, Ubuntu, Fedora, RHEL, SuSE etc. All of the stats sources except Wikimedia separate Linux and Android in this way. --Harumphy (talk) 12:40, 12 May 2011 (UTC)[reply]

Like LPfi said, anything Linux based is surly Linux kernel based; This is like how Linux is a Unix-Like OS. Your headings look redundant, especially to somebody that doesn't know about Linux; and it doesn't make somebody want to learn what the distinction is. I think the layout below clearly shows the distinction between normal Linux and android. "Linux based" is a link to "Linux kernel". "GNU/Linux" could be 2 separate links to GNU and Linux or one link to Linux Distribution. How is that not simple and clear? Jdm64 (talk) 20:22, 12 May 2011 (UTC)[reply]
Linux Based
GNU/Linux Android
The phrase "Linux based" is no more informative than just "Linux", because it doesn't make clear which of the two things called Linux forms the base. Is could mean either just the kernel or the kernel plus the stuff that makes a Linux distribution. So, to answer your question, it's not simple and clear because it's ambiguous. Sure, the kernel's always there, even in Android, but the other stuff isn't. By excluding the word kernel, it doesn't make it clear that Android is based on only the kernel and not the other stuff. The 'umbrella' heading should reflect what the things under it have in common. They have only one thing in common: the kernel. That is why the k-word is the key to comprehension here. --Harumphy (talk) 23:38, 12 May 2011 (UTC)[reply]
Ok, fine, include kernel. But that still doesn't remove the confusion about "Linux kernel based" and "Linux". It should be "GNU/Linux" to show how Linux kernel based is different than Linux. Jdm64 (talk) 01:24, 13 May 2011 (UTC)[reply]
Fair enough. Thanks. I'll settle for that. --Harumphy (talk) 07:13, 13 May 2011 (UTC)[reply]

I believe Android should be reclassified as a mobile device. See my comments there. hhhobbit (talk) 14:24, 5 June 2011 (UTC)[reply]

Count Amazon Kindle?

Amazon Kindle was reported to likely break 8 million units sold last year. http://www.slashgear.com/amazon-likely-to-break-8-million-kindle-units-sold-this-year-21120580/

With quite a few media being sold: http://news.cnet.com/amazon-kindle-books-outselling-all-print-books/8301-17938_105-20064302-1.html Better data is likely available. Seems these are significant numbers. --89.12.7.116 (talk) 20:57, 26 May 2011 (UTC)[reply]

This page is about usage share of operating systems, not devices. The OS that the Kindle uses is Linux, so if we were to add it, it would only be a small side note that the Kindle runs Linux. I think it's more appropriate that the information be added to Linux-based devices. Jdm64 (talk) 00:16, 27 May 2011 (UTC)[reply]

I have written this about thirty times and each time started over. I would like to do that again right now Saying Kindle is Linux is like saying Mac iOS is OS-X, or OS-X is FreeBSD. Mac OS-X uses launchd to start everything. Except for a few things that init starts, init is basically something that all other processes have as their parent if they lose their immediate parent. launchd does not work the same way. Is OS-x's launchd the same thing as init in Unix / Linux? No. The same thing is occurring with these mobile OS. One mobile OS has the distinction of being derived from nothing but being its own little entity from the start - Blackberry. All the other mobile OS are diverging so far away from what they were derived from that the code base is becoming meaningless. iOS really is that different from OS-X. But each OS is really not just the kernel. It is all of the things that go together including the hardware that make up that system. Unless you want to have a separate category for each of these mobile OS I suggest you lump them all together with the category mobile OS. They have more similarities with each other than they do with what they were derived from. Apple has joined Windows in having malware that self installs now with no password required on Macintosh OS-X as long as the user account you are using has administrator privileges. It has the promise of continuning that way unless Apple finally wises up and begins requiring a password for software installs for all OS-X users. May I humbly suggest these malware problems are making a lot of people mobile OS only users? But you have been caught napping. Apple sold more iPhone and iPad systems in the last two quarters than they did OS-X. The malware problems with the predominant desktop systems combined with Twitter and other things are making many current desktop OS systems dinosaurs. So I suggest you have a separate mobile OS category with maybe a break down showing what each was derived from. But the malware problems of the predominant desktop systems are rapidly making mobile OS as the tour de force of the future. Would I have predicted that two short years ago? No. I was also caught napping. It is rapidly progressing toward a future where many people will be mobile OS only users, storing their data in the cloud (data storage repositories) and printing to new printers that use BlueTooth. Any general mobile OS that doesn't make provisions to share the data that was created on it with a different general mobile OS from another vendor will rapidly become a relic of the past. IMHO, your current classification scheme was what was there in the past and what we have now is becoming increasingly incongruent with what you have. You are missing what has been happening with these mobile devices. Mobile OS are rapidly becoming the OS of the future. The fact that 8 million Kindle units have been sold indicates that things are changing. Did we have eight million new installs of desktop Linux systems last year? No. Your percentages are woefully out of data, but mostly because your categorization is wrong. Kindle is not Linux. iOS is not Macintosh / OS-X. They are now separate entities with very little similarity to what they were derived from. hhhobbit (talk) 02:57, 6 June 2011 (UTC)[reply]

The problem is I still don't know where the data would fit on this page given the current sections. I'm not saying the information is unimportant, just not suited for this page. This page is still about OSs, and the OS of the Kindle is Linux kernel based. It's just not a traditional desktop distribution. Similarly iOS is based on the Darwin OS, just like MacOSX. Jdm64 (talk) 20:41, 6 June 2011 (UTC)[reply]
To me, the more important issue (and the answer is not clear to me re Kindle), is under what circumstances should a device's OS fit into this article. In some sense, every automobile with a computer chip has an OS in it (a real-time kernel of some kind), but I doubt that fits the intent of this article. Kindle has a Linux kernel. How much else of what we think of as "an OS" does Kindle have? If it weren't for Kindle's ability to browse the web, it would be a single-purpose dedicated device, not really different (IMHO) than the smarts in an automobile - or a microwave oven for that matter. My point is, where to draw the line? Again, I don't know the answer to that. Perhaps a section for devices that can't download apps. If Kindle is included, then so should the 300 M NON-smart phones sold last quarter be included. (a variety of proprietary "OS"s) ToolmakerSteve (talk) 04:37, 21 August 2011 (UTC)[reply]

Long Term Suggestions

I was looking at the discussion, and at the article. What struck me is that the current active discussion topics seem to be discussing the different facets of the same issue, and I think that we should look at combining them. Problem is that since you sometimes link to me, I can't work on the page :) I can however make suggestions. My apologies if the formatting is a bit rough. Formatting on discussion pages drives me to distraction sometimes.

  • Technology Types

I think we can limit things to three technology types:

Personal Computers - Desktops, Notebooks, Netbooks, Laptops, Nettops, in other words any stand alone computing device which is designed to be used by a single user and which has a fill sized keyboard.
Mobile Devices - Tablets, EReaders, MP3 Players, Phones, in other words any stand alone computing device which is designed to be used by a single user, which while may have a keyboard it will not be full sized, or it will be an on-screen keyboard. Optional Bluetooth or USB keyboards do not count as they are not part of the basic device and it is designed to function without a keyboard.
Servers - in other words any computing device which is designed for multiple user use, either over a network, or through direct connection as was once common. Servers include all computing devices which are not stand alone such as Desktop Client units. Mainframes and Supercomputers are effectively specialized Servers.
  • Numbers

This gets fun. No matter what is done no one will be happy. I'd rather be too expansive here though. While it's difficult to be certain about reliability of any numbers below 5%, the fact that something shows up is of interest. Part of the problem is that everyone wants the numbers to favor them. This puts us in opposition to them, because we want the numbers to be accurate and favor no one.

Unless there is solid evidence that the numbers from a supplier are inaccurate we need to show them. If an analyst or investigator is able to come up with evidence that there is a problem we need to provide a link to it with a note that this supplier's numbers are questionable.

When we are displaying numbers, we need to make sure that the numbers are from the same time period. In the Server Usage Share we have dates of 2007, Jan. 2009, July 2009, September 2010, and Q1 2011. Dates this far apart are impossible to make a valid comparison with. We need to set a rule on age range allowed. My personal suggestion is that the widest range should be eighteen months. It might make the charts a lot smaller, but it will make them a lot more sensible.

Longest term we should probably consider splitting this into three articles, i.e. usage share for each technology type so that each type can be handled in far more detail. UrbanTerrorist (talk) 19:59, 12 August 2011 (UTC)[reply]

sales are not equivalent to use

Re "Moreover sales are not equivalent to use, as Windows comes pre-installed on many computers that will be used with other operating systems."

AFAIK, the actual PERCENTAGE of Windows computers that are wiped and replaced with Linux is small. I have NEVER heard any evidence, anecdotal or substantive, to counter that. Unless you have a SOURCE for that statement it should be re-worded. ToolmakerSteve (talk) 20:03, 22 August 2011 (UTC)[reply]

To balance the (IMHO overstated) emphasis on Windows sales not equating with usage, I've added a SOURCED reference mentioning PIRACY, which is a factor that increases usage above sales. (My interest is in making the best possible estimates comparing various desktop OS to smartphone OS unit usage. E.g. I want to know when Android passes Windows to become the #1 OS in units.) ToolmakerSteve (talk) 20:33, 22 August 2011 (UTC)[reply]

Apologizes for pushing this point further, but I just noticed that the 1% Median web browsing statistics for Gnu/Linux also is consistent with hypothesis "the % of PCs on which Windows are replaced by Linux is statistically small." IMHO, a fraction of a percent - less than the statistical error of the available sources - having no significant impact on the total Linux percentage. Thus, the vague adjective "MANY" in ".. pre-installed on many computers that will be used with other .." is inappropriate. However, since I have not found a source, I will leave it to the author who added that sentence, to reword it to be less misleading. I DO like the basic concept of pointing out to readers that there is a difference between sales and usage, so I DO favor keeping that sentence in some fashion; on the other hand, it is important to not overload/confuse/mislead the average reader with information that may be statistically minor. ToolmakerSteve (talk) 22:40, 22 August 2011 (UTC)[reply]

One possible proxy for Linux usage is downloads. It would be best to combine that with a survey that samples downloaders, to find what they are doing with their copy - if there is such a survey. E.g., I have a hard drive with an older version of Red Hat Linux on it. Not currently installed in a machine. Some fraction of Linux downloads are in dual boot setups with Windows - would be interesting to have users estimate how much time they spend in each OS. To distinguish between hobbyists experimenting with it occasionally versus substantive use. ToolmakerSteve (talk) 00:51, 23 August 2011 (UTC)[reply]

I added a source having an alternate analysis of ~ 6% for Linux in 2009, and showed that such analysis would yield an alternate Q2 2011 Linux figure of ~ 5 million. However, making that extrapolation might qualify as "original research", hence dubious. I've e-mailed the author requesting any updated figures/links. ToolmakerSteve (talk) 03:00, 23 August 2011 (UTC)[reply]

Please see the discussion at the top of this page about the C. Martin blog piece. In the light of this I've reverted this one edit.--Harumphy (talk) 10:52, 23 August 2011 (UTC)[reply]
Thanks, I had missed that discussion. I also have since learned that even if 6% had been credible momentarily in 2009, due to Netbook sales, the extrapolation to today would not be valid. In 2009, there may have been a period where Linux was more strongly selling on Netbooks, e.g. by Dell. Microsoft responded with low price for Windows 7 Starter on limited hardware (e.g. 1 GB RAM), and has successfully turned vendors such as Dell back into near-100% sellers of Windows. I find no significant support for the notion that anyone other than the rare highly technical user would choose Linux (for a PC), given Windows available at negligible cost. (Quite the contrary, there is anecdotal evidence that a more common action, when a Windows license adds significantly to a computer's cost, but is optional, is to purchase the computer with a free OS, and then replace that with a pirated copy of Windows.) ToolmakerSteve (talk) 11:00, 23 August 2011 (UTC)[reply]
After searching to see what reports are available for different OS segments, it occurs to me there might be a simple explanation as to why there aren't more available numbers for Linux sales on PCs, from the various research companies: maybe the numbers aren't worth reporting on. Why do I say this? Because it is notable that numbers for Linux server sales are readily available. (Granted, server sales are lower volume and higher dollar, so easier to track.) If Linux were making significant inroads in general PC use, more companies would deem that worth researching. I consider this additional indirect evidence that Linux sales volumes on general PCs continue to be < 5%. ToolmakerSteve (talk) 12:19, 23 August 2011 (UTC)[reply]

After further thought about reasons that might cause significant numbers of people to go to the effort of replacing an OS, and the POV tone of the sentence under discussion, I have replaced it with the following attempt at a neutral statement: "Also, sales may overstate usage. Most computers are sold with a pre-installed OS; some users replace that OS with a different one, perhaps for security reasons, or to install an OS for which more applications are available.[citation needed]" ToolmakerSteve (talk) 20:45, 23 August 2011 (UTC)[reply]

This is anecdotal, but I have personally replaced Windows with Linux on about forty or fifty computers. I know a lot of people who have done that on more computers than I have. So the number of computers using Linux in use could be far different than what the analysts are estimating. The issue is getting reliable numbers, and to the best of my knowledge, no one has proposed a method that seems likely to be reliable. UrbanTerrorist (talk) 20:07, 16 October 2011 (UTC)[reply]

That is interesting information. Would love to see some survey that indicates how widespread that is. Both in business use, and in home use. Is this being done by people with IT background? Primary reasons for doing so? ToolmakerSteve (talk) 02:36, 11 November 2011 (UTC)[reply]

LOL that graphic

that graphic on the right is not a good member of this page. it is created from impossible to identify data, and its citation is the page it is on. come on people. someone can do better than this. Forcep caliper (talk) 03:05, 30 September 2011 (UTC)[reply]

I agree. Referencing itself seems awkward. And giving usage shares for "web client operating systems" without saying what those are is even more confusing. — Preceding unsigned comment added by NotDifficult (talkcontribs) 07:45, 28 October 2011 (UTC)[reply]
In what way is the data impossible to identify? It's the median of eight sources, all of which are cited. So the figures can be verified precisely. What's the problem?--Harumphy (talk) 13:15, 28 October 2011 (UTC)[reply]

Ubuntu

"Clicky Web Analytics, StatOwl and Wikimedia indicate that Ubuntu has an order of magnitude more usage than any other identified desktop Linux distribution."

So does this mean that Ubuntu should get its own column in the chart?--Harizotoh9 (talk) 07:36, 25 October 2011 (UTC)[reply]

Somewhere in this talk page or its archives there is some discussion of what the threshold should be for including an OS in the web client table. For some time now the consensus has been that an OS only gets its own column if its identified by more than half the sources. Ubuntu is identified by three of the eight, but the consensus requires five out of eight. About four other Linux distros (IIRR Debian, RedHat, Fedora, SuSE) are also identified by three sources FWIW. --Harumphy (talk) 13:52, 25 October 2011 (UTC)[reply]
We should consider practical side of inclusion also. The horizontal space of the page is not infinite. 1exec1 (talk) 18:20, 29 October 2011 (UTC)[reply]

Median constitutes improper synthesis and original research

I first marked it as original research - a marking which was promptly deleted. I deleted the section and it was promptly reverted. The argument still stands: median is not an acceptable calculation:

  • While "well defined" it constitutes improper synthesis of the numbers it calculates over. It reaches a conclusion not supported by any of the sources. read WP:OR. It does not correctly reflect the sources.
  • Wikipedia policy requires consensus even for routine calculations like totals and counts. I marked it as WP:OR - a marking which should not be summarily deleted as was done by Harumphy.
  • Median is by no means a routine calculation; it is a statistical method which is not applicable in this setting: The result will be highly dependent on which sources are selected, ie the numbers are a result of article editing (specifically source selection) and not attributable to a source.

User Harumphy has threatened to treat it as edit warring if I remove the line again. However, my position stands: This is original research and it does not belong here. There is not consensus, so Harumphy, please remove that line yourself. Useerup (talk) 11:16, 30 October 2011 (UTC)[reply]

If the median was being used to infer something then it might constitute improper synthesis. But it isn't being used for that. It's just being stated as a median without any conclusion being drawn from it. The most relevant part of WP:OR is surely WP:OR#Routine_calculations, and there has to date been a consensus among editors here that it's OK as far as that policy is concerned.--Harumphy (talk) 11:29, 30 October 2011 (UTC)[reply]
Sure, if you delete the WP:OR markings you can claim consensus. Median is certainly not a routine calculation. Median, mean etc are original research and improper synthesis because the result is not supported by any of the sources. You are creating a synthesis over a number of sources. This is wrong on many levels, not least that the result will depend heavily of the sources selected. To use the median you need a source which calculated that median and which supports why a median is proper. There is no such source referenced, hence OR. Useerup (talk) 11:46, 30 October 2011 (UTC)[reply]
Reading archived discussions I don't see a discussion with a consensus at all. I see someone touched upon the subject by discussing the mean value - but no discussion and "consensus" on the applicability of median at all. But that really doesn't matter, as there is no consensus at this point. Useerup (talk) 12:23, 30 October 2011 (UTC)[reply]
To make matters worse, the median is supposed to "remove outliers" (per archived discussion). But the numbers do not at all express the same distributions. Some are demographically biased, others are openly geographically biased. Calculating a mean or a median (or any other statistical function) is not just OR - it is totally improper as it lumps together apples and oranges. Useerup (talk) 12:23, 30 October 2011 (UTC)[reply]
A few points in reply:
  • I agree that any past consensus becomes moot if there isn't consensus now.
  • I ask that given that the table's format has been stable for some time, it shouldn't be altered until a new consensus has been reached here first.
  • I disagree with your assertion that median is OR. In what way is a median less of a routine calculation than, say, the simple addition that is specifically endorsed by WP:OR#Routine_calculations? After all, they are both just forms of y=f(x1 ... xn). (If you know the values of x then there's only one possible value of y whether it's a median, mean or simple addition). You keep asserting that it's OR but ISTM that (a) you have not yet justified that assertion, and (b) even if it is, it's an allowable form of it. AFAICS what we're doing is entirely consistent with WP:OR#Routine_calculations. If you disagree, please explain why, don't just baldly assert your opinion as fact. --Harumphy (talk) 12:53, 30 October 2011 (UTC)[reply]
Median is not an routine calculation like a simple conversion between units of measure (inches to meters, birth date to age etc). It is a statistical function which is applicable in certain situations and not in others. For simple/routine calculations this is uncontroversial, you cannot argue that the conversion feet to meters introduces new knowledge or is open for interpretation. For statistical functions your are making assumptions and creating synthesis. I have explained why above: You are using median across a data set with very, very different numbers: Some numbers have expressed geographically bias, others has openly demographically bias. Median is as wrong as mean in those situations. If I introduce yet another stat counter (or remove one) it will immediately change the median number. Thus, the selection of sources becomes a basis for the calculated median. That selection is performed by wikipedia editors and has no basis in any of the sources. The policy is pretty clear, you cannot combine multiple sources to reach a conclusion not expressly supported by any one of the sources. Useerup (talk) 13:19, 30 October 2011 (UTC)[reply]
The median isn't a 'conclusion'. It's just a summary. A summary inevitably compromises precision in the pursuit of brevity. That doesn't render it invalid, or OR. What does anyone else think?--Harumphy (talk) 14:45, 30 October 2011 (UTC)[reply]
I think that saying that median is not routine calculation is itself OR, thus that assertion itself needs proper discussion before we can discuss its applicability here. Seriously, it has been discussed here already, and since the current table doesn't clearly violate any of the Wikipedia's policies, consensus has higher authority than anything else. 1exec1 (talk) 15:23, 30 October 2011 (UTC)[reply]
Going directly by WP:NOR:
  • Do not combine material from multiple sources to reach or imply a conclusion not explicitly stated by any of the sources. If one reliable source says A, and another reliable source says B, do not join A and B together to imply a conclusion C that is not mentioned by either of the sources. This would be a synthesis of published material to advance a new position, which is original research.
    • Here we have a table of 8 sources which say A B C D E F G H. This article joins all of those to imply conclusion M which is not mentioned by any of the sources. Please explain how that is not OR?
    • The use of median (even if OR was allowed) in this case seems highly doubtful. A median is computed over a homogeneous set of numbers expressing the same property for a number of observations, i.e. ages of students in a class. The problems here: :::::***the median here is not computed over the same property: One number is the web usage for mostly German sites, another "mostly" in U.S., a third "mostly" web designers and other self-selected communities. So what does the median represent? U.S.? Global? Germans? Coffee-drinkers? It makes no sense. It is the median of seeds in boxes of fruits. Some boxes with apples, some with oranges some with rotten bananas.
      • a median is only valid when computed over a complete set of observations. The number of students in a class is well-defined, countable, verifiable and finite. The number of web client usage share counters is uncountable and the selection here has been selected by editors.
  • This policy allows routine mathematical calculations, such as adding numbers, converting units, or calculating a person's age
    • These are simple mathematical arithmetic calculations and conversions; a far cry from statistical calculations. The closest you can get to mean or median (and that would be a stretch) is "adding numbers". However, this article not just adds numbers, it calculates a median over numbers from multiple sources, thus the calculation is sensitive to the the sources chosen by wikipedia editors, and thus is not supported by the sources. The sources may individually be reliable (with the caveats for each one) but the list has been comprised by WP editors and median calculated over that list. This is clearly new conclusions entered by WP editors.
    • It is even worse: At least 2 of the sources have been "corrected" by WP editors (in good faith, but still) further creating OR.
Claiming that "saying that median is not routine calculation is itself OR" is... strange. This is the talk page and not the Bizarro universe where everything is opposite. As everywhere on wikipedia the burden falls on the editor who wants to enter (or keep) a claim to demonstrate that it is not original research, see WP:VERIFY. To demand that anyone challenging a claim must first demonstrate the such a challenge itself is not OR is a novel take Useerup (talk) 16:54, 30 October 2011 (UTC)[reply]
You keep asserting that a conclusion is reached / implied by calculating median. It is not. This is explicitly stated in the article, along with the caveats regarding the accuracy and data skewing. Median is just that - a median of that data, no conclusion is implied anywhere that refers to the calculated median for support. Thus the SYNTH point is weak. 1exec1 (talk) 18:33, 30 October 2011 (UTC)[reply]
The bolded line with median is a conclusion. It states "this is the usage share". In the archived discussion the median is even pushed as a way to do away with "outliers". Useerup (talk) 19:27, 30 October 2011 (UTC)[reply]
The bolded line doesn't state it. It's just you inferring it. Thus the only error is in your own perception.--Harumphy (talk) 23:03, 30 October 2011 (UTC)[reply]
BTW your 'improper synthesis' tag on two of the table's footnotes is wrong so I'm removing it. This is a routine calculation that was unanimously approved by the three editors who discussed and voted on this very issue, and thus compliant with WP:OR (See the last vote in Talk:Usage_share_of_operating_systems#Vote_Count above.)--Harumphy (talk) 23:29, 30 October 2011 (UTC)[reply]
WP:NOTYOURS and WP:NOTDEMOCRACY. Issue stands - "correcting" numbers is improper synthesis. I apologize for being so blunt, but please don't remove tags before issue has been resolved. Useerup (talk) 01:09, 31 October 2011 (UTC)[reply]
So what exactly does the bolded median line state? Does it compute over the other rows (multiple sources)? Is it supported by any one of the sources? Are the numbers in that row attributable to reliable secondary sources or are they the result of wikipedia editing, i.e. selection of sources? Do any of the sources coalesce different demographics or different geographical regions and explain how median is a safe method? Useerup (talk) 01:20, 31 October 2011 (UTC)[reply]
It states what it says it states: the median. It states nothing else. That median is a routine calculation which has been approved by editors and is thus fully compliant with WP:OR#Routine_calculations. This policy is a specific exemption to the requirement for an external source: as long as the input figures are cited (and they are, right there in the column) the output of the calculation requires no external source. As far as selection of sources goes, we've included every remotely-credible source we know of that tracks multiple web sites. The sources do undoubtedly have various demographic, regional, linguistic and other biases, but that doesn't matter in the context of this dispute. (IMHO, the median is interesting because it probably helps to mitigate against such biases, but any such mitigation is not claimed either in the article or as a justification here.) Median is a 'safe method' for calculating a median. For that purpose, the only one claimed, it's 100% safe. I know of none safer.--Harumphy (talk) 08:36, 31 October 2011 (UTC)[reply]
Why is the median relevant here at all then? Why not an average? The WP:OR#Routine_calculations policy is for routine calculations based on a number from a single source, like calculating a persons age from a birthday or adding windows versions numbers to create a total for windows. The WP:OR policy specifically prohibits creating a synthesis from multiple sources. This median is exactly that, a number calculated from multiple sources. And you even have an opaque selection criteria for those sources. And it is not clear at all how you calculate that median; the numbers in the median row end up not even being comparable to each other (the row does not come to 100%). And the numbers used for the median calculation has even been "corrected" by editors as well. This is wrong on so many levels, but I will try to summarize them belowUseerup (talk) 10:10, 31 October 2011 (UTC)[reply]
In reference to Median is a 'safe method' for calculating a median. For that purpose, the only one claimed, it's 100% safe. I know of none safer: I know a 100% way to calculate an average. Let's go add that to the table, shall we?. It doesn't claim to be anything but an average, it's not a conclusion or anything; just an average. I also know a 100% safe way to calculate the product of all usage shares. Since it doesn't claim to be anything but a product of usage shares, we can add (multiply, rather) that as well. Let's throw in the sum as well; it doesn't claim to be anything but a sum. Useerup (talk) 10:29, 31 October 2011 (UTC)[reply]
For what it worth - I originally objected to the median in summary table on the same grounds as User:Useerup raises now. At the time, there indeed were majority of editors who thought it was not OR. So User:Harumphy is right - there was consensus. And while I still think it is OR, I also agree with User:Harumphy that any change to this tables or calculation would require new vote.Wikiolap (talk) 02:42, 31 October 2011 (UTC)[reply]
WP:NOTDEMOCRACY Useerup (talk) 10:10, 31 October 2011 (UTC)[reply]

Whether the median is OR or not is besides the point. It's just unimportant.Jasper Deng (talk) 21:50, 31 October 2011 (UTC)[reply]

Summary of issues with median row of the usage share table

RFC: Median calculation across multiple sources

The article (and a related article Usage share of web browsers) routinely calculates the median over operating system usage statistics from multiple sources. Article also uses the median numbers as basis a usage share graph. An editor has raised concerns that

  • median may not be a routine calculation (and thus original research)
  • the specific application of a median creates a synthesis over multiple sources
  • the specific application of a median is not statistically safe as the sources have known (and declared in article) geographical and demographical biases
  • the sources over which the median is calculated are selected by Wikipedia editors, the selection not being supported by any source
  • some of the sources have been adjusted by Wikipedia editors to allow for the cross-source calculation because the sources do not break out the numbers in the same way (mobile usage share).
  • medians calculated for operating systems individually yields a set of usage share percentages the total of which may exceed 100%

Some of these concerns have been discussed before, and editors have previously through a vote decided that median is applicable and not WP:OR. --Useerup (talk) 09:30, 6 November 2011 (UTC)[reply]


Because the debate above quickly focused on single, specific sub-issues (like that the median doesn't represent a conclusion but is just a median), I felt it necessary to create this summary section to make sure that what I consider the main issues are being addressed through the debate. --Useerup (talk) 09:30, 6 November 2011 (UTC)[reply]

For those who came here through RfC: please also read the previous discussion (#Median constitutes improper synthesis and original research), since there are already many arguments raised, some of which might not be repeated here. 1exec1 (talk) 14:58, 6 November 2011 (UTC)[reply]
I proposed a change to WP:CALC here. I think there is an issue with the policy itself. Please discuss. Thanks! 1exec1 (talk) 14:31, 7 November 2011 (UTC)[reply]

Median is not a routine calculation

Unlike simple arithmetic functions and conversions, the median is a statistics function and as such assumes a number of properties about the set over which it is used. So while it is well-defined it is by no means simple to apply appropriately (routinely), and consequently is not covered by wp:Or#Routine_calculations. There is a very big difference between calculating the age of a person from his birthday to creating a web usage share by calculating a medians. The former is uncontroversial, the latter is not.Useerup (talk) 10:12, 31 October 2011 (UTC)[reply]

Is this opinion based on anything? Dmitrij D. Czarkoff (talk) 18:58, 7 November 2011 (UTC)[reply]
yes. Read above Useerup (talk) 20:37, 8 November 2011 (UTC)[reply]
I've read above that You find it difficult to calculate median. I wanted You to state, what in the process of calculation was so difficult for You? I asked it in more detail in another thread of this discussion, but You don't reply there, so I asked in this thread. — Dmitrij D. Czarkoff (talk) 20:51, 8 November 2011 (UTC)[reply]

I don't understand how computing the median of a few numbers could be considered anything but one of the most routine calculations. I've frequently seen it done quickly and correctly by people whose grasp of mathematics is like that of a 10-year-old. A suggestion it is anything but routine without giving some reason why the particular case is somehow less routine than other instances of computing medians should be dismissed as absurd. Michael Hardy (talk) 15:37, 10 November 2011 (UTC)[reply]

I don't think the policy means in the calculation itself is simple, more the reasoning why the calculation is done is simple and a person would straightforwardly think of doing it. When a person calculates miles per gallon that is in some ways a more complicated calculation as it involves division but it is still much simpler in the sense of the policy as it is an obvious thing both to calculate and to choose to do and is done routinelyin the appropriate circumstances. Dmcq (talk) 17:19, 10 November 2011 (UTC)[reply]

The numbers are not homogeneous

In this case the set consists of numbers expressing usage share of a selection German language sites, usage share of commercial sites in the U.S, usage share of a web-designer oriented sites etc. These are not homogeneous and treating them as such is inappropriate. The median of Linux usage share risks ending up being a mean between the usage share in Germany and usage share with web designers. Useerup (talk) 10:12, 31 October 2011 (UTC)[reply]

The more diverse the user groups are involved, the more accurate the information is. Is being more accurate violates any Wikipedia policy? Dmitrij D. Czarkoff (talk) 19:00, 7 November 2011 (UTC)[reply]
Accurate? accurate??? I could create a whole list of native European language oriented stat counters, and that would create a heavy bias towards usage share in Europe (many diverse languages compared to, say, North America or even South America). How does usage share in China factor in? Accurate towards what? World share? How are the stat counters weighted then, considering that some very populous regions may be left out? The idea om "summarizing" these numbers is as far fetched as apples and oranges. More observations only improve a sampling when the samples are selected from a homogeneous set. That is not the case here. These stat counters were selected from what was "available", but they express wildly different shares (geographically, demographically and sampling methodology: unique users or page impressions). Useerup (talk) 20:47, 8 November 2011 (UTC)[reply]
The more data You add the more accurate result is. Any statistical study is somehow biased, and that's why this article needs median, because this is a statistical instrument to decrease bias as much as possible with no WP:OR.
So You have a simple choice: either ask for AfD or stop complaining.
Dmitrij D. Czarkoff (talk) 20:56, 8 November 2011 (UTC)[reply]

The selection is opaque and controlled by Wikipedia editors

The median picks the middle value (or the mean of the two middle values). Adding or removing a statistics counter will shift the numbers. The statistics counters have been picked by editors and it is not exhaustive (where's China, for instance, or accounting-oriented sites?). The selection criteria is not clear and certainly not supported by any source. Even if a criteria existed, qualifying the sources would in itself constitute original research.Useerup (talk) 10:12, 31 October 2011 (UTC)[reply]

Doesn't this raise questions regarding the table itself, and not the median? 89.180.27.148 (talk) 02:47, 4 November 2011 (UTC)[reply]
No, not in itself. Each source has been named, referenced and the potential bias has been explained. The reader can thus make his own judgement. The article does not infer anything. But when calculating the median, the editors implicitly claim a number of things:
  • That the sources are comparable in the first place (same demographics, cultures) since the median should only ever be used for comparable numbers. Think about weights, for instance. The table has a source which is heavily biased for German language sites. How many people frequent those sites, compared to commercial sites in the US? What then does the median calculated over those two express? Market share in the world? Market share in US? Market share in Antarctica? Now also factor in that some of the sources use unique visitors and other use page hits.
  • That the list is representative under an objective criteria. This is important because any addition will shift the median numbers. This list very well may be exhaustive, but that is not verifiable. There is no source that we can refer to which states that the statistic counters of this list are representative.
Useerup (talk) 06:25, 4 November 2011 (UTC)[reply]
The proper resolution here would be to add the sources You consider important here. And remove the WP:OR notice, since there is nothing original in it. — Dmitrij D. Czarkoff (talk) 11:10, 7 November 2011 (UTC)[reply]
The issue is not which sources I consider important. This concern is that the number being presented as "the median usage share" (even used in a graph) of a given operating system is not directly supported by any of the sources. Rather, that number is a synthesis from a list which will at all times be composed by the editors. I could question why a German-language oriented stat counter is in there. Depending on the outcome of that discussion the stat counter will be included or not. That in itself is not WP:OR. What IMO is synthesis is when those editor choices are reflected in a number being presented in the graph as well as "the usage share" of an operating system. That claim is not - and cannot be - supported by any of the sources. Useerup (talk) 15:12, 7 November 2011 (UTC)[reply]
It seems You fail to see the difference between cited value and directly supported value. The median number is synthesis at the same degree as each and every word in Wikipedia — they all are not literally taken frrom somewhere; instead they are the result of synthesis of all the sources. Median values are not special in this regard. — Dmitrij D. Czarkoff (talk) 15:24, 7 November 2011 (UTC)[reply]
No, the median is synthesis at the same degree as combining multiple sources in a way not foreseen and not supported by those sources, which is expressly forbidden. When the observations over which the median is calculated is under the control of editors it is not supported by any source. The simple calculation may be trivially verifiable, but the selection (or the criteria) of the list over which the median or average is calculated is not verifiable. That alone makes it OR. Useerup (talk) 16:42, 7 November 2011 (UTC)[reply]
What are You talking about? The median is the most trivial statistical instrument with absolutely no discretion on the editor's side. It is trivially verifiable. The verification method:
  1. exclude the maximum and minimum value pairs until no more then two remains;
  2. if there are two remaining values, sum them up and divide by two;
  3. the result should be checked against the value stated in the article.
Now, please tell me, which part of the instruction above You can't accomplish and why? Dmitrij D. Czarkoff (talk) 18:32, 7 November 2011 (UTC)[reply]
The problem is in the sampling. There is really no need to be derogatory and imply that I don't know how to compute a median. I can compute the median of total fruits observed on orange- and apple trees. I really can. But, does the median then say anything about fruits per tree, oranges per tree or apples per tree? These observations are heterogeneous. Page hits and unique users. German language sites and US commercial sites. Small oranges, big oranges, red apples and green apples. Useerup (talk) 21:30, 8 November 2011 (UTC)[reply]
It is the way the statistical values are calculated: You take as much results as You can regarding the base an count them. The median represents the best one can get from the referenced material.
In effect, the referenced material as such exists without Wikipedia and each of its member isn't notable on its own, so either we somehow sum it up here (calculate median) or we just AfD the page as per WP:N. Just that simple. As the topic itself is notable, the editors did their best to give the most valuable data it is possible to derive from sources without WP:OR. In form of example: the median of oranges per tree and apples per tree gives more information about the amount of fruits per tree then the raw data itself. — Dmitrij D. Czarkoff (talk) 21:43, 8 November 2011 (UTC)[reply]

The median calculation combines multiple sources

The median calculation combines numbers from multiple sources. This is in direct contradiction to WP:Or#Synthesis_of_published_material_that_advances_a_position. The position being advanced is the idea of quantifiable operating system market share. This position is not supported by any of the sources.Useerup (talk) 10:12, 31 October 2011 (UTC)[reply]

The market share numbers have been "corrected" by wikipedia editors

The numbers used in the calculation have themselves been "corrected" because they did not factor in the same way. This is improper WP:Synthesis in itself, but it makes the median even more original research/synthesis.Useerup (talk) 10:12, 31 October 2011 (UTC)[reply]

Doesn't this raise questions regarding the table itself, and not the median? 89.180.27.148 (talk) 02:46, 4 November 2011 (UTC)[reply]
Yes, the numbers in the table should not have been "corrected". That is clearly improper synthesis where an editor guesses at how to compensate for lack of statistics for mobile units. My suspicion is that this was considered necessary in order to be able compute the median. If you don't try to arrive at a single number, we could just make a note for the row in the table (and the mobile cells) that this is the case. Useerup (talk) 14:22, 4 November 2011 (UTC)[reply]
What "correction" exactly You believe to be improper synthesis? — Dmitrij D. Czarkoff (talk) 16:42, 8 November 2011 (UTC)[reply]
Relevant numbers are marked with improper synthesis?. Useerup (talk) 18:19, 8 November 2011 (UTC)[reply]
If You disagree with presumed results of Desktop/Mobile split, You have my full support here. But Your RfC was about Median; how do these issues relate? — Dmitrij D. Czarkoff (talk) 18:31, 8 November 2011 (UTC)[reply]
The median uses the "corrected" numbers as input. This is just one of the many issues with the median. Useerup (talk) 20:49, 8 November 2011 (UTC)[reply]
This is not an issue with median. This is an issue with table. Don't You notice the difference? — Dmitrij D. Czarkoff (talk) 21:07, 8 November 2011 (UTC)[reply]
Suppose that we now remove this correction (mobile split) and the usage share for the other (non-mobile) OSes are reset back to what the sources state (desktop shares increases). Now you have the median calculated across multiple sources where two sources have higher observed relative shares for desktop OSes compared to the others. I suppose the median magically erases that error source when it is calculated across the multiple sources, some with mobile split and some without? Useerup (talk) 21:38, 8 November 2011 (UTC)[reply]

Median is inappropriate even if the numbers were homogeneous

The median yields a line where the total of the usage share does not even come to 100%. Because the median of each column is calculated in isolation, the numbers of the median line end up not being comparable to each other. The problem is compounded (and further illustrated) by the fact that calculating the median on the "Median / Windows All versions" column will yield different results depending on whether you calculate median for each version and sum the medians or calculate the median of the "Windows all versions" in isolation. So not only are the numbers volatile with respect to the selection, they will also yield different results based on how editors factors versions, and the numbers still end up not being comparable to each other. Useerup (talk) 10:10, 31 October 2011 (UTC)[reply]

Is this the reason that a bar chart is used where a pie chart would be more appropriate? Because a pie chart is not possible without "correcting" over the others column, and because the sum can actually exceed 100%? Useerup (talk) 22:55, 1 November 2011 (UTC)[reply]
The pie chart can be used with any data. One can construct a pie chart on raw numbers. The most probable reason for a bar chart is that it is more illustrative. Dmitrij D. Czarkoff (talk) 19:04, 7 November 2011 (UTC)[reply]
Regarding the switch to bar chart from pie chart, this brief discussion may have been relevant. It was a discussion about the charts on another article, but there are enough editors who work on both that the logic may have been carried across. Certainly the way a set of medians work out was part of the thought process. --Nigelj (talk) 19:21, 7 November 2011 (UTC)[reply]
Thanks for that insight. I believe that the graph should display a stacked bar chart normalized to 100% instead of a simple bar chart. Each bar could illustrate each source. That way there wouldn't be a need to summarize the numbers, the sources can be clearly illustrated and the reader can seek out the explanation for deviating usage shares between sources in the text/table. The problem of breaking out Windows versions against OSes with no version break-out could then be easily solved by indicating the family relationship between Windows versions through coloring or patterns Useerup (talk) 20:08, 7 November 2011 (UTC)[reply]
You are wrong, pie chart is not appropriate for "any kind of data". Sure you can put anything into a pie chart, but pie charts inherently expresses shares of a whole. The pie chart would expose the problem with the median numbers: They are not comparable. The bar chart is inappropriate because it misrepresents the data as it breaks out the Windows versions in separate columns but keep the other OSes versions in summarized columns. Bars of a bar chart does not indicate share of anything like a pie chart does - the bars convey the idea of absolute numbers where one bar can increase without the others need to decrease. There are several graph types which are appropriate for shares: Pie charts, area charts or stacked bar charts; bar chart is not. However, creating a pie chart with usage shares where the others is labeled as 3% (the median of others, but takes up an 8% slice of the chart would be openly dishonest. It does go to show how the median is meaningless.Useerup (talk) 19:20, 7 November 2011 (UTC)[reply]
So what is this? ;-) — Dmitrij D. Czarkoff (talk) 20:44, 7 November 2011 (UTC)[reply]
I see a pie chart with no labels. Your point? Useerup (talk) 00:10, 8 November 2011 (UTC)[reply]
My point is that I've produced it from current median data in the article. Though data doesn't sum up to 100%, it does sum up to something, and the pie chart shows the shares in this total. — Dmitrij D. Czarkoff (talk) 00:50, 8 November 2011 (UTC)[reply]
A few questions:
  1. What title would you use for the graph?
  2. What does the entire disk represent?
  3. How would you label each slice if you were to show both the observation and the slice percentage size of the disk (as is customary for pie chars)?
Useerup (talk) 16:04, 8 November 2011 (UTC)[reply]
Here we go:
  1. "Operating system usage share (Median)".
  2. see above.
  3. Name and percentage as stated in table.
Dmitrij D. Czarkoff (talk) 16:40, 8 November 2011 (UTC)[reply]
The point I made ages ago in the discussion I linked above is that, if the data in a pie chart are percentages but they don't add up to 100%, then we can get anomalies like 48% looking like more than half, or 55% looking like less than half. No percentage is accurately represented by the angle used to represent it. This is misleading. One answer would be, if the total adds up to less than 100%, simply to have an unaccounted for gap in the pie (you cannot label it 'other', as there is likely already a median 'other' segment present). I am not sure what you might do if the medians add up to more than 100%. This is why we dropped pie charts of medians, as far as I remember. --Nigelj (talk) 18:41, 8 November 2011 (UTC)[reply]


Wrong. You are dodging the questions but are nevertheless proving my point. Calling the graph Operating system usage share would be dishonest and inaccurate:
  1. The proper title would be Operating system medians shares* share of sum of medians for all operating systems. with the caveat that *) For windows the usage share medians for individual versions are used and that the sum of those medians does not add up to the median of Windows overall. Yes, it is that bad. The graph does not depict operating system usage share.
  2. The entire disk represents the sum of all medians, not usage of operating systems. Same caveat as above.
  3. These would be the proper labels:
Windows 7: 33.0% (34.9%)
Windows Vista: 10.8% (11.4%)
Windows XP: 34.4% (36.5%)
Mac OS X: 8.2% (8.6%)
iOS: 3.7% (3.9%)
GNU/Linux 1.2% (1.2%)
Android 1.4% (1.4%)
Symbian 0.2% (0.2%)
Blackberry 0.4% (0.4%)
Other 1.2% (1.3%)
Need to explain that first percentage numbers are medians while percentages in parenthesis are each median share of total medians. The total of the medians comes to 94.4%. That is why each median's share of the total is higher than the median itself. If you had chosen to show one slice for Windows (like for the other OSes) instead of one for each version the median of Windows would have been (according to table) 79.8%. Summing the medians for the Windows versions only yields 78.2%. That difference would mean that the slice sizes would change yet again. Get that? The total and the size of the other slices change depending on whether you break Windows up or not! But then again, we can also just pretend that there is no problem by not including the labels.
This is so amateurish that I cannot believe I have to explain these things. The usage of median as well as the chart is so utterly wrong. And I have to argue against someone who thinks that if one can create an SVG that is proof positive that median is appropriate. I am going to call in an expert in statistics. Useerup (talk) 18:51, 8 November 2011 (UTC)[reply]
The reason there is an argument about it is because it was set up by an editor on Wikipedia rather than being something 'out there'. It is amateurish and wrong but it is the sort of thing some people in reliable sources sometimes do and it gives a general feel. That does not mean we should do it. And we should not do it. It is what Wikipedia calls original research. Dmcq (talk) 19:13, 8 November 2011 (UTC)[reply]
(e/c) I think I finally see what you're trying to say (Useerup), and it is almost exactly the same as I was saying except I was referring the matter to the pie chart. I can't parse the title "Operating system medians shares* share of sum of medians for all operating systems" at all. Have you left out an apostrophe, or accidentally used the wrong word somewhere in it? The point that makes our medians valid is that it is true to say, "The median of our selected eight estimates of the usage share of Android is 1.39%". We clearly show the eight we chose, and give reasons why we chose them. The calculation is trivial per CALC. Where it becomes tricky is to take a row of medians and try to do something else with them as if they were another row of data. They are not. They are a set of discrete facts, not members of a row that adds up to 100%, or meets any other criteria as a set. That is why you cannot display them in a pie chart. You certainly can't start calculating the percentages in parentheses above, 'each median's share of total medians'. That is effectively what the pie chart does, and both are wrong. It makes little emotional difference with the figures we have for most OSs at the moment, but at the time when the usage share of the MSIE browser was dropping through 50% it might have seemed very important to some people to keep 48% looking greater than half... Your parenthesised figures might have allowed that, and that would have been wrong. There is nothing wrong with each median taken individually, though, and quite a lot that is useful and right. --Nigelj (talk) 19:25, 8 November 2011 (UTC)[reply]
So, if the sum of "median usage shares" comes to more than 100% (as it does at the browser share article and which could very well happen here) don't you think that there's something wrong with the share? Yet, the medians are included side-by-side, encouraging comparison of numbers which clearly are not comparable and even plotted in a graph, further encouraging comparison of incomparable numbers. I would say that the fact that you can arrive at a total share above 100% is a pretty damaging impeachment against any such calculation. This is what happens when statistics is being applied without proper knowledge of the field.Useerup (talk) 20:53, 9 November 2011 (UTC)[reply]
May be You would actually state, in which way is it amateurish? To date You just make statements. What about actually proving them? — Dmitrij D. Czarkoff (talk) 20:12, 8 November 2011 (UTC)[reply]
It seems You don't have even the most general knowladge about statistics. First, let's pass to questions:
  1. To prove Your position You truncated my answer. Impolite and shows You can't find a proper grounding for Your position. The Median is a valid and widely used statistic method, so the heading "Operating system usage share (Median)" is absolutely OK, as wel as is OK the heading "Usage share of web client operating systems. (Source: Median values from Usage share of operating systems for August 2011.)". The heading "Operating system usage share" is not OK, and that is the reason You can't find it on the article page.
  2. The entire disk represent the total of the median values, which is exactly what is supposed to represent the disc of "Operating system usage share (Median)" pie chart.
  3. The labels You propose would not be the proper labels as the pie chart represents the median values, which are not supposed to sum up as 100% per se.
Do You really understand what is the difference between mean and median values? Do You understand, why median is prefered for representation the total result of similar statistical research on different user bases?
P.S.: Please, stop screwing indentation! This is not the first time You reply shifts right in the middle.
Dmitrij D. Czarkoff (talk) 20:12, 8 November 2011 (UTC)[reply]
Oh, now, after a week of rioting here You actually state the lack of expertise. May be just shut the whole thing up? You are clearly in the minority, and there is an evidence of the consensus on the question before. — Dmitrij D. Czarkoff (talk) 20:45, 8 November 2011 (UTC)[reply]
I added the tag because I hope that an expert can explain this blatantly obvious misuse of statistics better than I can. Useerup (talk) 20:53, 8 November 2011 (UTC)[reply]
Well one of my degrees is a masters in statistics but I do not believe the main problem here would be solved by correct use of statistics. It is a basic failure to follow Wikipedia policy. We should not be calculating figures like this, it is against WP:CALC. If it requires special expertise to know whats right and an expert to explain it s just not a routine calculation. You could easily put all the figures into one chart using different colour lines and that would require zero thought to understand and explain. Just get rid of the unnecessary calculation. Dmcq (talk) 00:21, 9 November 2011 (UTC)[reply]
Why do You think that median calculation requires any skills or any expert? What can You cite in favour of Your opinion? Please stop just complaining and pass to actually proving Your complains. The only argument so far was that values don't sum up as 100%. The next obviously needed step for this discussion to ever turn into real discussion is proving that there is at least something wrong with it. And to actually make it possible to somehow resolve the issue You have to provide a vision on how to summarise the referenced data. — Dmitrij D. Czarkoff (talk) 01:24, 9 November 2011 (UTC)[reply]
About expertise as the plumber said knowing where to hit the pipe is what you're paying for. Yet another of Wikipedia's policies besides WP:CALC is WP:BURDEN. We don't need sources to remove stuff. We need sources to justify keeping stuff in. We should not be proving anything. And we do not need to prove anything. Even if there was a burning need to make a simple graphic I've pointed out how you can do that without causing problems. Just get rid of the median line in the table and that form of the graphic. Dmcq (talk) 08:45, 9 November 2011 (UTC)[reply]
WP:BURDEN has no relation to this discussion: the median is referenced with its data sources, which are probably the best referenced material in Wikipedia. And these references are out of scope of this discussion. If You have problem with the referenes, please start a new thread.
Here we discuss the current representation of data, so please, actually be exact and answer the questions in my previous reply. Or just say You can't. — Dmitrij D. Czarkoff (talk) 11:17, 9 November 2011 (UTC)[reply]
Your stuff simply is not in line with policy. But okay I'll answer. There are multiple representations. For instance many people use geometric mean to stick a whole lot of disparate data of this type together. Then there's other ways which take more account of that they are all from slices of a whole. Then there's all the other problems like compatibility and weighting by reliability. The figure produced is in essence an judgement, in other words dependent on a point of view. What entitles anybody here to make such judgements? This is the sort of stuff the original research policy was designed to combat, people using their own judgements to stick their own ideas into articles. WP:CALC is deliberately minimal and this just drives a coach and horses through it. If a reliable source does this sort of thing we can report it an I would not normally feel too worried about the problems but we should not do it ourselves. I really do not need to explain this to you, it is straightforwardly against policy. Dmcq (talk) 12:11, 9 November 2011 (UTC)[reply]
So Your point is that if there are many methods out there, wikipedian can't use any of them. That effectively means that wikipedians can't add any data, as there are always alternative approaches. Your reading of the policy is too restrictive and is itself a mere POV. — Dmitrij D. Czarkoff (talk) 15:25, 9 November 2011 (UTC)[reply]
Even if there was only a single way it would still not be just a routine calculation like changing kilometres to miles. The only way I could see for justifying the graphic is under WP:OI as an illustration rather than an accurate sourced fact, but the median line would have to be removed from the table or put somewhere where it clearly was just for illustrative purposes and not something justified by the sources. I'd just remove the median line in the table as after all it is easily calculated. Put in the table it is an editors own analysis, and one's own analysis is original research on Wikipedia. Just because a number is easily calculated does not mean it is a routine calculation. Dmcq (talk) 16:04, 9 November 2011 (UTC)[reply]
Even without a degree in statistics it should be obvious that how to apply statistical analysis (such as a median calculation) requires deliberation, not only on whether the data is safe for such an analysis, but also on whether the result is at all meaningful. This is one of the situations where not only is the median meaningless, it is also original research. As Dmcq says: It is not a routine calculation. This very article demonstrates why. Sure, a median is calculated, but then it is immediately compared to other medians in the table, and plotted in a graph. Comparing medians is meaningless. Readers will believe the graph shows usage shares compared, while in reality it shows mangled data. Useerup (talk) 16:32, 9 November 2011 (UTC)[reply]
OK, as You don't say anything new again, I assume You have nothing more to say. So, I quit this discussion for now as I don't see neither valid points on your side nor significant amount of Your supporters to be able to call Your position WP:CONSENSUS. — Dmitrij D. Czarkoff (talk) 17:03, 9 November 2011 (UTC)[reply]
I see noi consensus here for keeping the current median line in the table. I think I can agree with keeping the graphic under WP:IMAGES#Pertinence and encyclopedic nature "Consequently, images should look like what they are meant to illustrate, even if they are not provably authentic images". I think it scapes in under WP:OI "Original images created by a Wikipedian are not considered original research, so long as they do not illustrate or introduce unpublished ideas or arguments". The median line in the table however does not satisfy WP:CALC "This policy allows routine mathematical calculations, such as adding numbers, converting units, or calculating a person's age, provided there is consensus among editors that the arithmetic and its application correctly reflect the sources". That is clearly for things like converting miles to kilometres. It wouldn't be routine even if a consensus here thought it was. Dmcq (talk) 18:10, 9 November 2011 (UTC)[reply]
Agree with Dmcq. Median is not supported by a source and can be removed on this basis alone. Other issues remain as well (improper synthesis, disparate sources, sources selected by WP editors). Useerup (talk) 18:27, 9 November 2011 (UTC)[reply]
Please refrain from wikilawyering: changes need WP:CONSENSUS, so it's Your task to convince everyone. BTW, someone here has mentioned the previous consensus which has resulted in the current median line. — Dmitrij D. Czarkoff (talk) 20:07, 9 November 2011 (UTC)[reply]
A change does not need consensus when the change is to remove original research. Don't game the system. WP:NOR is one of 3 core content policies and generally trumps guidelines. To outline this pretty clear policy:
  • The term "original research" (OR) is used on Wikipedia to refer to material—such as facts, allegations, and ideas—for which no reliable, published source exists. This includes any analysis or synthesis of published material that serves to advance a position not advanced by the sources . (my emphasis of the parts relevant to this debate):
Median is synthesis of published material. You can try to call it "summarize" or use other words for it. But it remains synthesis: A new "fact" is derived from the sources but not attributable any of them.
  • To demonstrate that you are not adding OR, you must be able to cite reliable, published sources that are directly related to the topic of the article, and directly support the material as presented . (my emphasis of the parts relevant to this debate):
None of the sources directly (not even indirectly) supports usage share as presented through a median. In fact, they couldn't, could they? It is a synthesis of multiple sources.
There may not be much point in debating this further. The median is WP:OR and is going to be deleted per the core WP policy. If you cannot accept this you can try to raise it on a noticeboard. Useerup (talk) 21:19, 9 November 2011 (UTC)[reply]

General responses

None of the eight sources is based on web devlopers' usage share. That source (w3schools) is not used in this table.
I think I've spent enough time on this dispute now. This table has been around for a long time and is the product of consensus here. The sudden arrival of one new editor is not sufficient to disrupt that, no matter how aggressive, disgruntled and sure of his opinions he is. So until Useerup's view gains more support, I assume he's on his own and will spend my time doing something more interesting than repeating arguments which have now had a good airing on both sides.--Harumphy (talk) 11:34, 31 October 2011 (UTC)[reply]
I actually support his view. The median is unimportant, as it often represents outdated data, you know what I'm saying here?Jasper Deng (talk) 21:48, 31 October 2011 (UTC)[reply]

What now?

Afer this summary... what are the next steps? It was proposed previously in the discussion that we should vote (again) about having or not the median line there. What are we waiting for, exactly? -- 89.180.146.171 (talk) 20:05, 1 November 2011 (UTC)[reply]

In my initial attempts at discussing these topics only a few of my points were addressed. But the discussion was quickly going in the direction of whether the median line constituted a "conclusion" or not. My concern was that we were splitting words while the core issues were being ignored. That is why I broke them out. I'm waiting for someone to discuss these issues. My concern is that the median line is wrong (median is the wrong function to apply here for multiple reasons) and it is contradiction to one of the WP pillars which says no original research. I am willing to have this discussion. For the record, I do not consider a vote appropriate on these issues, per WP:NOTDEMOCRACY. Useerup (talk) 22:45, 1 November 2011 (UTC)[reply]
Just open a request for comments, since there's no consensus over this issue. WP:NOTDEMOCRACY does not apply here since it concerns with using voting as primary means to reach consensus. Here we use arguments and that's accepted. 1exec1 (talk) 21:58, 5 November 2011 (UTC)[reply]
Thanks for the suggestion. I opened an RFC. Useerup (talk) 12:21, 6 November 2011 (UTC)[reply]

Rfc discussion

I've had another look and I think it can be counted as an illustration rather than content which lessens the requirements. It should however be in or just beside the section with the source data that it is illustrating rather than at the top and it would be far better if it also illustrated the minimum and maximum lines for each. Dmcq (talk) 18:15, 7 November 2011 (UTC)[reply]
The illustration was actually a secondary issue. The primary concern was about the median line in the table, not the illustration. Hence such a debate. 1exec1 (talk) 18:51, 7 November 2011 (UTC)[reply]
I'd missed that. Sticking median into the main text without a special marker saying it is just something we thought of doing is wrong. We really do need to distinguish very carefully between illustrations and examples and the main text. I think the median line should be removed from the table. Dmcq (talk) 19:17, 7 November 2011 (UTC)[reply]
What kind of marker are You talking about? The row is specifically called Median, which is a bold marker. If You don't like medians in general, please explain why. Until You do so, You don't participate in discussion — You just pollute the discussion with baseless statements. — Dmitrij D. Czarkoff (talk) 20:35, 8 November 2011 (UTC)[reply]
  • Simple descriptions of mathematical information are always permissible. There are some issues above about whether the underlying data is appropriate, but saying that "this is the median for our dataset" is no more prohibited than saying "The median height of US Presidents is a bit less than six feet tall". Encyclopedias are supposed to summarize information (including mathematical information), not just dump raw data on the readers and tell them that if they want to know more, they should buy their own stats software. WhatamIdoing (talk) 15:42, 7 November 2011 (UTC)[reply]
agree. Reasoned in discussion threads. Dmitrij D. Czarkoff (talk) 19:23, 7 November 2011 (UTC)[reply]
Graphic not requiring medians, in fact requiring minimal thought to produce
We should not be reporting median heights for presidents either unless a source reports it. We can represent the data without this messing around by having something like different coloured lines for each source in a single table. It would take no more room than the current table, it would require no explanation, it wouldn't cause a dispute, and it would show the variation in the sources which this doesn't. As to presidents we should not even have a table of their heights in the first place unless a source starts listing them out. Just because we can do something doesn't mean we should. Dmcq (talk) 12:20, 9 November 2011 (UTC)[reply]
If the subject of the article or section is about the physical characteristics of US Presidents, then we most certainly should—and we should do it by saying that the median height is a bit less than six feet, or by saying that they range from 163 cm to 193 cm, not by typing the eye-glazing details, e.g., "US Presidents have had the heights of 193 cm, 193 cm, 189 cm, 188 cm, 188 cm, 188 cm, 187 cm, 185 cm, 185 cm, 185 cm, 183 cm, 183 cm, 183 cm, 183 cm, 183 cm, 183 cm, 183 cm, 183 cm, 182 cm, 182 cm, 182 cm, 182 cm, 180 cm, 180 cm, 179 cm, 178 cm, 178 cm, 178 cm, 178 cm, 177 cm, 175 cm, 175 cm, 174 cm, 173 cm, 173 cm, 173 cm, 173 cm, 171 cm, 170 cm, 170 cm, 168 cm, 168 cm, and 163 cm", even if the particular source we are using happens to provide the full data. WhatamIdoing (talk) 23:42, 9 November 2011 (UTC)[reply]
If a source lists the heights but does not give the average then the average is not something that has shown itself to have WP:DUE] interest. Producing figures like that just because we think they are interesting or should be done is original research. Dmcq (talk) 23:57, 9 November 2011 (UTC)[reply]
Here for example is a graphic which required no original thought to produce from the data. I just copied it over to Excel and selected the very first graphic type with no options. As you can see it requires no decisions. I'm sure something a bit prettier could be done but as you can see this gets the data over okay. Dmcq (talk) 13:41, 9 November 2011 (UTC)[reply]
You graphic represents why there must be a median. On its own your graphic may be only of some use in the article about the relation on statistical research and reality. This article is no place for such has no place here. — Dmitrij D. Czarkoff (talk) 15:19, 9 November 2011 (UTC)[reply]
It could have gone in straight per policy but you're saying you just don't like it. Life isn't always simple. I've said above how WP:OI might be justified for the original image but the median line should be removed from the table whatever way it is done. Dmcq (talk) 16:10, 9 November 2011 (UTC)[reply]

Is there a consensus to include the median line?

Support

  1. Support use of median in the table, as some form of summary is absolutely needed in situation of considerable amount of RAW statistic data. As median obviously does not violate WP:CALC but instead is absolutely required as per WP:NOTSTATSBOOK, WP:LINKFARM, WP:NOTDIR and WP:NOT PAPERS, it must be present. The title of the page is Usage share of operating systems, so the user comes here to get information about the usage share, not for the means of calculation of the usage share and the organisations that work in this area (though this information should also be present). Leaving the user with numerous reprints of the statistic data from the bunch of researchers is not enough. — Dmitrij D. Czarkoff (talk) 23:01, 9 November 2011 (UTC)[reply]
  2. Support, but with an explanatory footnote so that readers (and future editors) know where these numbers came from. WhatamIdoing (talk) 23:50, 9 November 2011 (UTC)[reply]
  3. Support. I'm still under the belief that the median doesn't violate wikipedia's original research guideline. The reason for this is that the median is clearly stated as a "median" and we are not claiming to be an original source. Likewise, the graph clearly states where the data comes from and that it is not meant to be an original source. Could there be a use of the median which does OR, yes but I don't think that's happening here. Instead, it adds a concise summary to the table, similar to an intro paragraph, but for data. Furthermore, it allows for the creation of the pie chart that people are still interested in, without which, a single "biased" source would have to be used to represent this page. Showing all sources next to each other in a bar graph is hard do read because of the clutter. Also, because the sources are fairly close to each other, visually you will estimate the median any ways to look past the clutter. Clearly, the use of the median has more benefits than any negative or inconsistencies that may exist in keeping it. That's why I continue to support the use of the median for this page. Jdm64 (talk) 00:09, 10 November 2011 (UTC)[reply]
  4. Support. The use of median provides a useful summary of the data, the data is fully sourced, and the calculations are explained in the footnotes to ensure verifiability. Without the median, the table just becomes a sea of raw numbers in search of a purpose. An encyclopedia exists to provide a summarised introduction to a subject for lay readers. It does not exist to appease the purist obsessions of academics.--Harumphy (talk) 08:19, 10 November 2011 (UTC)[reply]
  5. Support. That table is not very helpful if there's no reasonable summary (arguments per Jdm64). I agree with the argument that the desktop/mobile split is WP:OR and should be removed, but this is a different issue. If this is done, median or mean to summarise a table is definitely not WP:OR. 1exec1 (talk) 09:25, 10 November 2011 (UTC)[reply]
  6. Support It may not be mathematically perfect to use the median - but I don't see how anyone could be fooled into thinking these figures are precise in the first place. I think a median is a sensible way to present the data. If we use a mean we are far worse off, if we don't provide a summary then it makes the article harder to read. Also I would like to express my astonishment that people are making a fuss about using this - it seems totally innocuous to include a median. If wikipedia policy forbids it, wikipedia policy should be changed. Dilaudid (talk) 09:39, 10 November 2011 (UTC)[reply]
  7. Support   The median shows a central tendency. Yes, it is far from perfect. The point is: Is it helpful to the readers? Does anyone disagree that it shows a central tendency? In summary I see the argument as practical versus being overly caught up in rules and not using common sense. Daniel.Cardenas (talk) 04:22, 11 November 2011 (UTC)[reply]

Oppose

  1. Oppose use of median line in the table, fails WP:CALC and it is not an illustration or example. I believe however the graphic is okay as a reasonable illustration of the figures as per WP:OI if we make it obvious it is a summary generated by some Wikipedia editors, instead of "Usage share of web client operating systems. (Source: Median values from Usage share of operating systems for August 2011.)" perhaps something like "Summary of usage share figures using the median of the sources in the table". Dmcq (talk) 21:31, 9 November 2011 (UTC)[reply]
  2. Oppose because the median is unimportant and not a good representation of the data.Jasper Deng (talk) 23:02, 9 November 2011 (UTC)[reply]
  3. Oppose. I do find medians very useful, and I think they improve the article, but it doesn't take away the fact that they are WP:OR. To allow them would be stepping on a slippery slope, in fact, after first we allowed them, next change was to use desktop/mobile percentage from one source in order to compute desktop share of another source, which is definitely WP:OR.Wikiolap (talk) 05:23, 10 November 2011 (UTC)[reply]
  4. Oppose stats, including medians, are based on sample sets and sample sets are derived from conscious choices. The median measure is used to estimate a central tendency. Both sample sets and central tendency are by definition terms relevant only to the forming of conclusions and predictions to some "real" as approximated by some "sample". So despite the deceptively simple arithmetic, the whole point to a median is to help derive conclusions and predictions, ie "original research", about something that can't be otherwise be pinned down. What's to be in the sample set is a choice. What measurements to factor and where to get them are choices. If calcs, such as median, are disputed then a good rule of thumb would be that they're not "routine" calculations. As in, a "median" of cell phone numbers would be both simple and arithmetically valid but not routine. Professor marginalia (talk) 05:39, 10 November 2011 (UTC)[reply]
  5. Oppose use of median line. It is not a routine calculation anywhere near the examples of WP:CALC; it is calculated across multiple sources selected by WP editors. The WP:OR is compounded by the fact that WP editors has felt it necessary to "correct" (WP:SYN) the numbers of some of the sources before the calculation. Furthermore, it is not clear what the median results express: They do not express shares of anything as the total of those shares may exceed or fall below 100% and the sources represent different demographics and geographic/cultural usage patterns which has not been shown as neither complete nor representative for a "summary". If the article after deletion falls into WP:STAT it should be AfD'ed. However, I believe the topic is notable and that the sources are ok. I prefer a fix where sources are still set up in a table with proper notes as to possible bias. Another graph could be produced: Horizontal stacked bars, each 100% with each slice an OS or version, each bar a source. Useerup (talk) 06:52, 10 November 2011 (UTC)[reply]
  6. Oppose. Of course we should not calculate a median of figures that were collected in different ways. Itsmejudith (talk) 08:07, 10 November 2011 (UTC)[reply]

Conclusion: No consensus

Only routine calculations are allowed per WP:CALC, and only if there is consensus among editors that the arithmetic and its application correctly reflect the sources.. Such consensus does not exist. I am deleting the median, removing the expert request. I'll leave the graph in for now, pending suggestions on how to visualize data without performing synthesis.Useerup (talk) 18:41, 10 November 2011 (UTC)[reply]

You can't do such statements until discussion ends. It didn't. Furthermore, there is no consensus on leaving statistic data without summary. — Dmitrij D. Czarkoff (talk) 18:48, 10 November 2011 (UTC)[reply]
Yes I can: WP:OR. Even if you consider it "routine calculation" WP:CALC such calculations are only acceptable when there is consensus among editors that the arithmetic and its application correctly reflect the sources. Per the policy it is WP:OR, and original research must be removed. The WP:BURDEN of evidence lies with the editor who adds or restores material (i.e. you). There is no evidence of such consensus, rather there is evidence that such consensus does not exist. Since you cannot demonstrate consensus that median is a routine calculation it is gone. So please revert your last edit which restored the median. Useerup (talk) 19:01, 10 November 2011 (UTC)[reply]
Again, You didn't prove the absence of consensus yet. I will remove this line on my own, but only after the relevant sections here become less active. Now the discussion is ongoing, so there is no proof of Your claims that consensus is not reached. — Dmitrij D. Czarkoff (talk) 19:05, 10 November 2011 (UTC)[reply]
I do not have to prove anything. It is you who are restoring material, the WP:BURDEN is on you. The support/oppose sections above do not support that there is consensus that median is allowable under WP:CALC. Therefore, it is not. I say again: Please remove the median, your revert is in direct contradiction to WP policies. Useerup (talk) 19:16, 10 November 2011 (UTC)[reply]
The process of reaching consensus is out of scope of WP:BURDEN. Your edit fails both WP:ADMINSHOP and WP:DRNC, and I reverted it as such. The problems are not solved yet, I see no obligation to undo my revert. — Dmitrij D. Czarkoff (talk) 19:29, 10 November 2011 (UTC)[reply]
Please see WP:RFC#Ending RfCs. RfC lasts until consensus has been reached. Currently there is no consensus as you admitted, so you can't close the discussion. Also, per the same policy: "[Closure of a RfC] should be done by an uninvolved editor". 1exec1 (talk) 19:27, 10 November 2011 (UTC)[reply]
Lack of consensus has clearly been demonstrated above. "Routine calculation" was the only way to avoid the median calculation being classified as WP:OR. But I can wait. Useerup (talk) 19:44, 10 November 2011 (UTC)[reply]
The first comment in the thread about consensus was posted 2011-11-09 20:31 (UTC). Your edit happened 2011-11-10 18:43 (UTC). No fact can be established in 22 hours on internet. That said, the common duration of such procedure is 30 days. — Dmitrij D. Czarkoff (talk) 19:49, 10 November 2011 (UTC)[reply]

I think we should ask for outside advice through the use of Wikipedia:Dispute resolution noticeboard. I've already typed a request up, but haven't submitted it yet. Should I submit it? Jdm64 (talk) 20:18, 10 November 2011 (UTC)[reply]

Will we need to establish a consensus on this one also? ;) Jokes aside, I think that's a good idea - it would certainly be beneficial to see some help of uninvolved people that could guide the discussion. 1exec1 (talk) 20:38, 10 November 2011 (UTC)[reply]
Do you have any reason to believe that will lead to another conclusion? IMO this is mainly policy discussion and it is hard to see how it will lead to another result. But if you believe so, then go ahead.Useerup (talk) 20:32, 10 November 2011 (UTC)[reply]
You should, as some of us are a step away from something very wrong. — Dmitrij D. Czarkoff (talk) 20:34, 10 November 2011 (UTC)[reply]

I oppose the inclusion of that median graph. I especially oppose its current position, at the top of the page. IMHO, this MISLEADS a casual reader who comes to this page. It gives an "authoritative" impression, that the percentages shown in the graph are a "good" approximation to OS market share. That is, the reader might miss the distinction between measuring "web client" and measuring the actual installed base. (Which there is no known way to do.) I do favor providing the user with a line at the bottom of the web clients table, which provides that median information. IMHO, that gives a better "context" to the information. Can such a line be automatically calculated from the contents of the table, so that when any editor updates the table, the calculation is updated? ToolmakerSteve (talk) 03:07, 11 November 2011 (UTC)[reply]

I consider the median to be a justifiable calculation because it is (a) straightforward, (b) intuitive, and (c) easily verified by any reader. I disagree with the assertion that the "selection of sources" makes it OR. ESPECIALLY if it is merely a summary line in a table. What I oppose, is placing that information in such a way that it is given more prominence than is justified; e.g. don't make it appear to summarize the entire article, don't make it appear to answer the user's question authoritatively. There is no authoritative answer; this needs to be clear. ToolmakerSteve (talk) 03:14, 11 November 2011 (UTC)[reply]
A bit more about the topic of being OR. As long as it is presented as a mere "calculation aid", then it makes sense to me. A knowledgeable reader, faced with the table, would want to know the average. An unknowledgeable reader, would want to know what they should do with the information, to make sense out of it; if the "knowledgeable" answer would be "calculate the median", then do so, for them. Presumably a statistician would also calculate an "error estimate", but I hesitate to suggest doing so, because that risks being misleading. Specifically, it is only an estimate of error based on the numerical differences between the sources -- it assumes the sources are "free of errors". A distinction that would be lost on 99.9% of readers. Better to "do the simplest thing". ToolmakerSteve (talk) 03:32, 11 November 2011 (UTC)[reply]
A statistician would be more likely to produce something like in the article Summary statistic. For disparate data it is just a bit finger in the air. For this sort of stuff it wouldn't matter much outside WIkipedia but we need to follow our standards or figure out new standards. Dmcq (talk) 12:02, 11 November 2011 (UTC)[reply]
OK, I just read the rest of the discussion. Did not realize that the median was originally added as a line in the table, and that this was considered MORE controversial than the current graph. IMHO, the reasoning here is backwards. The graphic risks being noticed/perused/referenced/copied on its own; a median line in the table is "safer", it would tend to be examined with the table, not referenced on its own. I agree that such a median line would need to be clearly distinguished from raw source data. IF it is so distinguished, I completely fail to understand what is controversial about it. However, I agree that UNTIL/UNLESS consensus is achieved, it should be omitted. ToolmakerSteve (talk) 04:15, 11 November 2011 (UTC)[reply]

Ask yourself what is wikipedia's goal? To provide reliable information. Does presenting a central tendency of disparate data helpful to the readers? The alternative is to present a bunch of numbers that few want to look at or to fight with editors which one represents the "best" numbers. The central tendency from the table also known as a "summary" is used in quite a few other pages. It is a compromise between options that are not ideal. The compromise helps people understand the "usage share of web browsers". Daniel.Cardenas (talk) 05:24, 11 November 2011 (UTC)[reply]

Changing subjects somewhat. This isn't about OS market share. This is about which web browsers are used most to browse the internet. Daniel.Cardenas (talk) 05:24, 11 November 2011 (UTC)[reply]
Agree but that should be a separate section from this. Best to separate problems. That's why they have 'web client' operating systems further down but yes the title does imply something different. Dmcq (talk) 11:53, 11 November 2011 (UTC)[reply]

Windows 7 is now the Widely Used Operating System

Many Websites and other media reports that Windows 7 is now the most widely used operating system and has overtaken Windows XP. I wanted to request other editors to kindly check and refresh the usgae share to keep the page updated. Even the Windows XP page says that it is the 'second most popular version of windows' which logically shows that Windows 7 is now on the top. Meanwhile Windows Vista share has also changed indicating more usage for Windows 7. Changing both text and graph will be better. Thanks TheGeneralUser (talk) 16:42, 30 October 2011 (UTC)[reply]

Six of the eight sources that we track in this article still have XP as greater than 7, but on recent trends that's likely to change in about a couple of months or so. The Windows XP page cites w3schools as its source, a source which monitors web usage by web developers only. They are a small and highly atypical subset of the total user population so for our purpose it's not a credible source.--Harumphy (talk) 23:13, 30 October 2011 (UTC)[reply]

Looks like Win 7 passed XP in US, but has not yet done so globally. ToolmakerSteve (talk) 03:45, 11 November 2011 (UTC)[reply]

Monthly update

Usually I do a major update to the web client stats on the first of the month, as that's when four of the eight sources update. In view of what's happened here recently I can no longer be bothered.--Harumphy (talk) 08:44, 1 November 2011 (UTC)[reply]

While I understand your point, and of course you're free to contribute or not, I am sorry that you took that decision. Yes, there's a debate going on about the "Median" line in the article, but nothing to stop the update of the page (not with new stuff but with update of what's already there) until that debate is over... 89.180.146.171 (talk) 19:59, 1 November 2011 (UTC)[reply]
Harumphy, I personally hope that you will continue to update those stats. IMHO if something is so controversial that it is leading to the level of upset and tit-for-tat that we are seeing .. we should err on the side of "quiet", and omit the median information. Median is not nearly as important as having up-to-date sources! And having multiple contributors! ToolmakerSteve (talk) 03:50, 11 November 2011 (UTC)[reply]

Undue weight

In its introductory sentence the article "Usage share of operating systems" states "Different categories of computers use a wide variety of operating systems, and the usage share varies enormously from one category to another." but the picture prominently presented in the top right corner only refers to "Usage share of web client operating systems" which is just a marginal amount of computer systems and thus giving undue weight to the Windows operating system. Even when looking at 32Bit CPUs only, desktop computers account for only 2% CPUs sold (see microprocessor section "Market statistics"). As the total number of CPUs sold is estimated at 1 billion this accounts for about 20 Million CPUs but the Top 500 list of supercomputers of June 2011-06 already counts 7 million (mostly 32Bit-desktop-)CPUs of which almost 6.5 millions (91%) are driven by Linux while Windows counts only 63140 CPUs or 1% and in the preformance statistics Windows even falls below the margin. In 2004 there were an estimated 548,380 PCs in use worldwide "Number of PCs by country", ITU. 2004 thus with respect to CPU count Windows market share can be estimated below 50% when excluding embedded systems and around 15% when including embedded systems Embedded systems survey --BerlinSight (talk) 14:50, 13 November 2011 (UTC)[reply]

  1. Your statistical analysis is dubious: while the desktop CPUs are 2% of total microprocessor sales, the rest of microprocessors sold include DSP's (You can typically find several of those in a single desktop computers), microprocessors for dumb cell phones, vehicle electronic equipment, home appliances (wash machines, watches, etc.), audio/video players, TV sets and other sorts of equipment that don't typically run operating systems.
  2. The embedded systems typically run custom operating systems using Linux kernel, which don't actually belong here, as they are mere firmwares, just like the firmwares in wireless adaptors, RAID controllers, and other computer parts or PC BIOSes;
  3. The desktop systems constitute the point of interest for main part of readers and editors (as You can note examining this talk page and its archives);
  4. The image in question is related to a Median line of the table, which is currently disputed in Dispute resolution noticeboard, so right now it may be the wrong time for this discussion.
Dmitrij D. Czarkoff (talk) 15:18, 13 November 2011 (UTC)[reply]
I believe the article reflects the state of research by external sources. There is abundance of sources analyzing usage share of operating systems used as web clients, and also market share of smartphones operating systems - both are well represented in the article. The graph showing web clients is appropriate, as most sources focus on this statistics, and the article reflects that.
It would be appreciated if you will add new section about embedded operating system, as it is currently missing from the article.Wikiolap (talk) 17:59, 13 November 2011 (UTC)[reply]