From Wikipedia, the free encyclopedia
Jump to: navigation, search

Meaning (interpretation) of Variance[edit]

While reading the article on variance, I found that a paragraph on how variance should be interpreted (in an intuitive way) was somehow missing. I don't feel confident to write that bit myself, but if someone could add this part it would I believe make the article more interesting and complete. — Preceding unsigned comment added by Marc saint ourens (talkcontribs) 18:29, 1 September 2013 (UTC)

Done. Thanks very much for the suggestion! Duoduoduo (talk) 16:56, 2 September 2013 (UTC)

Variance for six-sided die[edit]

Isn't the variance given in the article only correct for an infinite number of rolls of a die? For one roll, the variance is zero. For two rolls, a quick calculation suggests the variance is 5/8. Grover cleveland (talk) 16:27, 4 June 2014 (UTC)

Variance is not dependent on an expeiment, but only on properties of the die. You are confused with sample variance. Nijdam (talk) 20:06, 4 June 2014 (UTC)
Grover, you are correct. There are actually two distinct concepts, both called "variance". One kind of variance is calculated from samples. The other kind is derived from the equations of a theoretical probability distribution, and represents the variance that would be calculated from an infinite number of samples generated by that distribution. More care should be taken within the community to be clear whether they mean distribution variance versus sample variance versus population variance. The wikipedia page is an especially important place to be careful, because it's purpose is geared more towards learning, and less for convenience by those already familiar with the concepts.Seanhalle (talk) 17:05, 16 September 2015 (UTC)

Too encyclopedic and mathematical[edit]

normal people come here to look for simple applicable definition of variance and are overwhelmed by technical details.

I found this to be more usefull to me:

And I was distinguished Science graduate 20 years ago. — Preceding unsigned comment added by (talk) 03:06, 16 March 2015 (UTC)

I agree, this page is written in a very obtuse way. (talk) 02:23, 12 July 2015 (UTC)
Agreed. is far superior. I wonder if this stems from differences in opinion among editors about the nature and purpose of Wikipedia. Is it a record of knowledge only comprehensible to experts in the field, or is it a resource for learning about fields where the reader is not an expert? Currently, maths articles lean heavily towards being records of knowledge written by experts and comprehensible exclusively by those experts. I actually avoid Wikipedia for maths these days and this page is an example of why. I can't get the simple equation for the unbiased variance out of the reams of esoteric <insert random perjorative noun> written in this article. There is also no article for Sample variance so I can't search for it unless I write it in a talk comment. Instead it is hidden 3/4 of the way down a very long article (and still inferior to the mathworld version). Doug (talk) 18:56, 13 November 2015 (UTC)

Misleading text in intro: reality does not contain statistical distributions, they are models created by people and fit to observations[edit]

This text of the intro makes incorrect and misleading statements: "The variance is a parameter that describes, in part, either the actual probability distribution of an observed population of numbers, or the theoretical probability distribution of a not-fully-observed population from which a sample of numbers has been drawn. In the latter case, a sample of data from such a distribution can be used to construct an estimate of the variance of the underlying distribution; in the simplest cases this estimate can be the sample variance."

The text implies that an "actual distribution of an observed population" is something that exists in reality. In fact, this notion is a common mistake, driven by the way the human mind is constructed. For some people the world of numbers is real, in some sense, and is commonly conflated with what exists outside of our heads.

What can, though, be said to exist is a population of numbers that are generated according to a probability distribution model. Those numbers, when measured, fit to that same distribution model very well. But that same set of numbers can be fit to any distribution, with varying degrees of goodness of fit. The set of numbers has no inherent "actual distribution". It only has a particular distribution that it happens to fit best to. Even if one takes the case of an endless stream of observations generated from a distribution model, that stream of observations can still be fit to any distribution, it just is best predicted by the same model as the one used to generate the numbers.

This is a general concept that, in my experience in teaching statistics, has led to a high degree of confusion among students. It takes diligence to avoid this common misconception, because the human mind naturally wants to equate the models that our heads have inside them with the reality that is outside of our heads. When taking a step back, it is clear that the human mind has only models inside it, which it fits to observations coming in. When those models fit well, we experience that emotionally as the model IS the reality outside our heads. We feel that somehow that model we have somehow exists out there, generating the observations. In many cases, no harm is done by believing this.

However, in statistics we reach the point where we are at the boundary point, where we are examining this very bifurcation between models and reality. Allowing this emotional tendency to let us lose sight of the difference between a model and the reality that the model is fit to, I have found to be the cause of confusion for students who try to learn statistics. That conflation messes with their heads.

Therefore, I propose to alter these words to the following:

"The variance is a measure that is calculated from observations, and is independent of any particular probability distribution model. For each kind of probability distribution model, an equation can generally be derived that relates parameters of the model to the variance. If the observed population consistently yields observations that match well to a particular distribution model, then a relatively small sample of observations can be used to construct a good estimate of the variance that would be obtained if all possible samples were taken."

(This carefully worded statement avoids many commonly taken for granted assumptions, such as that the generating process is stationary, and it highlights the difference between a model that we fit to observations versus the measured reality that is generating the observations. And it still gets the concept across that we need only a few samples to come close the omniscient value -- in the case that the generator matches to the model well)

If there are no strenuous objections, I will come back in a week or two and make this change. Seanhalle (talk) 05:10, 11 September 2015 (UTC)

The existing text is an attempt to explain the difference between "population variance" and "sample variance". It doesn't do that very well, but it isn't talking about fitting models at all. Your text is not about the same thing and doesn't serve as a replacement. It will also confuse readers: "The variance is a measure that is calculated from observations" is not the whole truth. Both theoretical distributions and finite populations have variances that are independent of observation. Also the fact that the sample variance is an unbiased estimator of the population variance doesn't depend on the probability model except in as much as the population variance is finite. Your words seems to hint instead at the power of tests, but nobody will get it. McKay (talk) 07:52, 11 September 2015 (UTC)
thank you for the response. I hear what you are saying regarding my proposed alternative, however the need remains for an improvement over what is there now. I do, though, disagree that theoretical distributions have a variance. That is, if by "theoretical distribution", you are referring to things such as the binomial distribution, poisson distribution, and so forth. Such things are better viewed as generators of observations. Such generators do not have an inherent variance, but rather have a means to calculate the variance that would be obtained from an infinite set of observations generated by the distribution. They are most commonly used as models, and one compares actual observations to observations that would be generated by the model. If the fit is good, then one can use the equations of the distribution, which tell you characteristics that would be measured on an infinite set of observations generated by that model.. and thereby predict that further real world observations will conform to those calculated characteristics.
With this view, such a distribution model has no inherent variance, just a way to predict the variance measured from observations generated from the model. This goes to the heart of the way of thinking that makes statistics difficult for students. Making observations be the fixed point, and treating distributions as generators of observations clears up the confusion and has resulted in dramatic changes in the learning experience of students.
Your point is well taken, though regarding population variance versus sample variance.
As a compromise, I would be willing to go with the following:
"The variance is a characteristic of a set of observations, which are either measured from a real world system or generated by a theoretical probability distribution or other generating model. In the ideal case, all possible observations of the system would be available, where the variance calculated from this set is called the population variance. In most cases, however, only a finite sample size is available, and the variance calculated from this is called the sample variance and is considered an estimate of the full population variance. Theoretical probability distributions can be viewed as generators of observations, and the variance of an infinite set of observations generated by them can be mathematically determined via an equation. Such generators are used in thought experiments to generate a finite sample size, or real world observations are often fitted to these theoretical probability distributions. In either case, the set of observations is considered a not-fully-observed sampling of the possible population. Because the sample sizes are incomplete, the variance calculated using the samples is an estimate of the variance of the full population; such an estimate can be calculated in several ways, the simplest of which is just the straight forward variance of the sample."

Seanhalle (talk) 09:29, 11 September 2015 (UTC)

I hope you had a good weekend. I like your attempt to distinguish population from sample variance from population variance. But I can't accept that theoretical distributions don't have variances; that is contrary to the standard definitions of probability theory. A distribution is just a mapping from an event space to real numbers satisfying some axioms, and the variance has a precise mathematical meaning. I know what you mean by "theoretical probability distributions can be viewed as generators of observations", but I'm a mathematician who dabbles in probability. I think most readers won't have a clue what you mean. Actually I think that the relationship between actual populations and abstract models of them belongs in some other article, perhaps statistics, and only adds unnecessary complexity here. McKay (talk) 05:10, 14 September 2015 (UTC)
This is an excellent discussion. Your point is well taken. If this were a forum solely for mathematicians, then it would make sense to leave the wording in a form that is the most comfortable for mathematicians. However, I would suggest that this forum is actually the opposite. I would be surprised if very many mathematicians came to the wikipedia page in order learn what variance is. And, if they did happen to do so, I expect that they would immediately see what the suggested wording is aiming at, and would feel uncomfortable with it, but would nonetheless be unfazed, and would move forthwith to the equations and the precision within the body of the article. Hence, I expect that a quite small number of visitors will have the value of this article reduced by the wording suggested above for the summary section.
In contrast, the majority of visitors I would expect to be those with a typical background, from across a wide variety of fields. The article should be tailored to provide that audience the maximal benefit. Given that, the wording should help people in that audience the most, even if it does make mathematicians a bit uncomfortable. Hopefully we can avoid a situation where the tyranny of the minority causes harm to the large majority. If the wording conveys concepts that give the reader the ability to successfully work with variance in whatever facet of life they apply it, then it is okay for the wording to choose alternatives to strict mathematical rigor. The important point is that it does no harm to the majority, while providing them with measurable value. In its current form, the wording is more tailored to mathematicians and as a result is impenetrable to normal people. As such, insistence on strict rigor ends up harming the large majority of visitors to the page, in order to satisfy the sensibilities of the (few) mathematicians who visit. Although legitimate, and I feel their pain, that pain of the few is measured against the pain of the page being impenetrable to the majority. That harm should be avoided.
However, I do see your point. In the following, I have added some disclaimer words that make it clear that the wording is not meant to be mathematically rigorous, but rather to be valuable to the large majority of readers. The concepts are sound, and useful, and the disclaimers alert the reader to look further for full rigor if they so desire. In the end, no harm is done, but high value is gained, by many.
"There are two distinct concepts that are both called "variance". One variance is a characteristic of a set of observations. The other is part of a theoretical probability distribution and is defined by an equation. When variance is calculated from observations, those observations are either measured from a real world system or generated by a theoretical probability distribution or other generating model. If all possible observations of the system are present then the calculated variance is called the population variance. Normally, however, only a subset is available, and the variance calculated from this is called the sample variance. The variance calculated from a sample is considered an estimate of the full population variance. There are multiple ways to calculate an estimate of the population variance, as discussed in the section below.
The two kinds of variance are closely related. To see how, consider that a theoretical probability distribution can be used as a generator of observations. If an infinite number of observations are generated using a distribution, then the sample variance of that infinite set will match the distribution's equation derived variance."

Seanhalle (talk) 16:56, 16 September 2015 (UTC)

99.7263% (based on a sample size of one) visitors to this page are looking for an explanation of and equations to estimate the sample variance. The remaining visitors who are looking for a presentation of population variance have no need of this article. Please take this into consideration. Doug (talk) 19:09, 13 November 2015 (UTC)

Should in fact be 99.7264% Nijdam (talk) 10:58, 16 November 2015 (UTC)